Category Archives: D and C

Interfacing D with C: Strings Part One

Digital Mars D logo

This post is part of an ongoing series on working with both D and C in the same project. The previous two posts looked into interfacing D and C arrays. Here, we focus on a special kind of array: strings. Readers are advised to read Arrays Part One and Arrays Part Two before continuing with this one.

The same but different

D strings and C strings are both implemented as arrays of character types, but they have nothing more in common. Even that one similarity is only superficial. We’ve seen in previous blog posts that D arrays and C arrays are different under the hood: a C array is effectively a pointer to the first element of the array (or, in C parlance, C arrays decay to pointers, except when they don’t); a D dynamic array is a fat pointer, i.e., a length and pointer pair. A D array does not decay to a pointer, i.e., it cannot be implicitly assigned to a pointer or bound to a pointer parameter in an argument list. Example:

extern(C) void metamorphose(int* a, size_t len);

void main() {
    int[] a = [8, 4, 30];
    metamorphose(a, a.length);      // Error - a is not int*
    metamorphose(a.ptr, a.length);  // Okay
}

Beyond that, we’ve got further incompatibilities:

  • each of D’s three string types, string, wstring, and dstring, are encoded as Unicode: UTF-8, UTF-16, and UTF-32 respectively. The C char* can be encoded as UTF-8, but it isn’t required to be. Then there’s the C wchar_t*, which differs in bit size between implementations, never mind encoding.
  • all of D’s string types are dynamic arrays with immutable contents, i.e., string is an alias to immutable(char)[]. C strings are mutable by default.
  • the last character of every C string is required to be the NUL character (the escape character \0, which is encoded as 0 in most character sets); D strings are not required to be NUL-terminated.

It may appear at first blush as if passing D and C strings back and forth can be a major headache. In practice, that isn’t the case at all. In this and subsequent posts, we’ll see how easy it can be. In this post, we start by looking at how we can deal with NUL termination and wrap up by digging deeper into the related topic of how string literals are stored in memory.

NUL termination

Let’s get this out of the way first: when passing a D string to C, the programmer must ensure it is terminated with \0. std.string.toStringz, a simple utility function in the D standard library (Phobos), can be employed for this:

import core.stdc.stdio : puts;
import std.string : toStringz;

void main() {
    string s0 = "Hello C ";
    string s1 = s0 ~ "from D!";
    puts(s1.toStringz());
}

toStringz takes a single argument of type const(char)[] and returns immutable(char)* (there’s more about const vs. immutable in Part Two). The form s1.toStringz, known as UFCS (Uniform Function Call Syntax), is lowered by the compiler into toStringz(s1).

toStringz is the idiomatic approach, but it’s also possible to append "\0" manually. In that case, puts can be supplied with the string’s pointer directly:

import core.stdc.stdio : puts;

void main() {
    string s0 = "Hello C ";
    string s1 = s0 ~ "from D!" ~ "\0";
    puts(s1.ptr);
}

Forgetting to use .ptr will result in a compilation error, but forget to append the "\0" and who knows when someone will catch it (possibly after a crash in production and one of those marathon debugging sessions which can make some programmers wish they had never heard of programming). So prefer toStringz to avoid such headaches.

However, because strings in D are immutable, toStringz does allocate memory from the GC heap. The same is true when manually appending "\0" with the append operator. If there’s a requirement to avoid garbage collection at the point where the C function is called, e.g., in a @nogc function or when -betterC is enabled, it will have to be done in the same manner as in C, e.g., by allocating/reallocating space with malloc/realloc (or some other allocator) and copying the NUL terminator. (Also note that, in some situations, passing pointers to GC-managed memory from D to C can result in unintended consequences. We’ll dig into what that means, and how to avoid it, in Part Two.)

None of this applies when we’re dealing directly with string literals, as they get a bit of special treatment from the compiler that makes puts("Hello D from C!".toStringz) redundant. Let’s see why.

String literals in D are special

D programmers very often find themselves passing string literals to C functions. Walter Bright recognized early on how common this would be and decided that it needed to be just as seamless in D as it is in C. So he implemented string literals in a way that mitigates the two major incompatibilities that arise from NUL terminators and differences in array internals:

  1. D string literals are implicitly NUL-terminated.
  2. D string literals are implicitly convertible to const(char)*.

These two features may seem minor, but they are quite major in terms of convenience. That’s why I didn’t pass a literal to puts in the toStringz example. With a literal, it would look like this:

import core.stdc.stdio : puts;

void main() {
    puts("Hello C from D!");
}

No need for toStringz. No need for manual NUL termination or .ptr. It just works.

I want to emphasize that this only applies to string literals (of type string, wstring, and dstring) and not to string variables; once a string literal is included in an expression, the NUL-termination guarantee goes out the window. Also, no other array literal type is implicitly convertible to a pointer, so the .ptr property must be used to bind them to a pointer function parameter, e.g., `giveMeIntPointer([1, 2, 3].ptr).

But there is a little more to this story.

String literals in memory

Normal array literals will usually trigger a GC allocation (unless the compiler can elide the allocation, such as when assigning the literal to a static array). Let’s do a bit of digging to see what happens with a D string literal:

import std.stdio;

void main() {
    writeln("Where am I?");
}

To make use of a command-line tool particularly convenient for this example, I compiled the above on 64-bit Linux with all three major compilers using the following command lines:

dmd -ofdmd-memloc memloc.d
gdc -o gdc-memloc memloc.d
ldc2 -ofldc-memloc memloc.d

If we were compiling C or C++, we could expect to find string literals in the read-only data segment, .rodata, of the binary. So let’s look there via the readelf command, which allows us to extract specific data from binaries in the elf object file format, to see if the same thing happens with D. The following is abbreviated output for each binary:

readelf -x .rodata ./dmd-memloc | less
Hex dump of section '.rodata':
  0x0008e000 01000200 00000000 00000000 00000000 ................
  0x0008e010 04100000 00000000 6d656d6c 6f630000 ........memloc..
  0x0008e020 57686572 6520616d 20493f00 2f757372 Where am I?./usr
  0x0008e030 2f696e63 6c756465 2f646d64 2f70686f /include/dmd/pho
...

readelf -x .rodata ./gdc-memloc | less
Hex dump of section '.rodata':
  0x00003000 01000200 00000000 57686572 6520616d ........Where am
  0x00003010 20493f00 00000000 2f757372 2f6c6962  I?...../usr/lib
...

readelf -x .rodata ./ldc-memloc | less
Hex dump of section '.rodata':
  0x00001e40 57686572 6520616d 20493f00 00000000 Where am I?.....
  0x00001e50 2f757372 2f6c6962 2f6c6463 2f783836 /usr/lib/ldc/x86

In all three cases, the string is right there in the read-only data segment. The D spec explicitly avoids specifying where a string literal will be stored, but in practice, we can bank on the following: it might be in the binary’s read-only segment, or it might be in the normal data segment, but it won’t trigger a GC allocation, and it won’t be allocated on the stack.

Wherever it is, there’s a positive consequence that we can sometimes take advantage of. Notice in the readelf output that there is a dot (.) immediately following the question mark at the end of each string. That represents the NUL terminator. It is not counted in the string’s .length (so "Where am I?".length is 11 and not 12), but it’s still there. So when we initialize a string variable with a string literal or assign a string literal to a variable, the lack of an allocation also means there’s no copying, which in turn means the variable is pointing to the literal’s location in memory. And that means we can safely do this:

import core.stdc.stdio: puts;

void main() {
    string s = "I'm NUL-terminated.";
    puts(s.ptr);
    s = "And so am I.";
    puts(s.ptr);
}

If you’ve read the GC series on this blog, you are aware that the GC can only have a chance to run a collection if an attempt is made to allocate from the GC heap. More allocations mean a higher chance to trigger a collection and more memory that needs to be scanned when a collection runs. Many applications may never notice, but it’s a good policy to avoid GC allocations when it’s easy to do so. The above is a good example of just that: toStringz allocates, we don’t need it in either call to puts because we can trust that s is NUL-terminated, so we don’t use it.

To be very clear: this is only true for string variables that have been directly initialized with a string literal or assigned one. If the value of the variable was the result of any other operation, then it cannot be considered NUL-terminated. Examples:

string s1 = s ~ "...I'm Unreliable!!";
string s2 = s ~ s1;
string s3 = format("I'm %s!!", "Unreliable");

None of these strings can be considered NUL-terminated. Each case will trigger a GC allocation. The runtime pays no mind to the NUL terminator of any of the literals during the append operations or in the format function, so the programmer can’t trust it will be part of the result. Pass any one of these strings to C without first terminating it and trouble will eventually come knocking.

But hold on…

Given that you’re reading a D blog, you’re probably adventurous or like experimenting. That may lead you to discover another case that looks reliable:

import core.stdc.stdio: puts;

void main() {
    string s = "Am I " ~ "reliable?";
    puts(s.ptr);
}

The above very much looks like appending multiple string literals in an initialization or assignment is just as reliable as using a single string literal. We can strengthen that assumption with the following:

import std.stdio : writeln;

void main() {
    writeln("Am I reliable?".ptr);

    string s = "Am I " ~ "reliable?";
    writeln(s.ptr);
}

writeln is a templated function that recognizes when it’s being given a pointer; rather than treating it as a string and printing what it points to, it prints the pointer’s value. So we can print memory addresses in D without a format string.

Compiling the above, again on 64-bit Linux:

dmd -ofdmd-rely rely.d
gdc -o gdc-rely rely.d
ldc2 -ofldc-rely rely.d

Now let’s execute them all:

./dmd-rely
562363F63010
562363F63030

./gdc-rely
5566145E0008
5566145E0008

./ldc-rely
55C63CFB461C
55C63CFB461C

We see that dmd-rely prints two different addresses, but they’re very close together. Both gdc-rely and ldc-rely print a single address in both cases. And if we make use of readelf as we did with the memloc example above, we’ll find that, in every case, the literals are in the read-only data segment. Case closed!

Well, not so fast.

What’s happening is that all three compilers are performing an optimization known as constant folding. In short, they can recognize when all operands of an append expression are compile-time constants, so they can perform the append at compile-time to produce a single string literal. In this case, the net effect is the same as s = "Am I reliable?". LDC and GDC go further and recognize that the resulting literal is identical to the one used earlier, so they reuse the existing literal’s address (a.k.a. string interning). (Note that DMD also performs string interning, but currently it only kicks in when a string literal appears more than twice.)

To be clear: this only works because all of the operands are string literals. No matter how many string literals are involved in an operation, if only one operand is a variable, then the operation triggers a GC allocation.

Although we see that the result of an append operation involving string literals can be passed directly to C just fine, and we’ve proven that it’s stored in read-only memory alongside its NUL terminator, this is not something we should consider reliable. It’s an optimization that no compiler is required to perform. Though it’s unlikely that any of the three major D compilers will suddenly stop constant folding string literals, a future D compiler could possibly be released without this particular optimization and instead trigger a GC allocation.

In short: rely on this at your own risk.

Addendum: Compile rely.d on Windows with dmd and the binary will yield some very different output:

dmd -m64 -ofwin-rely.exe rely.d
./win-rely
7FF76701D440
7FF76702BB30

There is a much bigger difference in the memory addresses here than in the dmd binary on Linux. We’re dealing with the PE/COFF format in this case, and I’m not familiar with anything similar to readelf for that format on Windows. But I do know a little something about Abner Fog’s objconv utility. Not only does it convert between object file formats, but it can also disassemble them:

objconv -fasm win-rely.obj

This produces a file, win-rely.asm. Open it in a text editor and search for a portion of the string, e.g., "I rel". You’ll find the two entries aren’t too far apart, but one is located in a block of text under this heading:

rdata SEGMENT PARA ‘CONST’ ; section number 4

And the other under this heading:

.data$B SEGMENT PARA ‘DATA’ ; section number 6

In other words, one of them is in the read-only data segment (rdata SEGMENT PARA 'CONST'), and the other is in the regular data segment. This goes back to what I mentioned earlier about the D spec being explicitly silent on where string literals are stored. Regardless, the behavior of the program on Windows is the same as it is on Linux; the second call to puts doesn’t blow anything up because the NUL terminator is still there, one slot past the last character. But it doesn’t change the fact that constant folding of appended string literals is an optimization and only to be relied upon at your own risk.

Conclusion

This post provides all that’s needed for many of the use cases encountered with strings when interacting with C from D, but it’s not the complete picture. In Part Two, we’ll look at how mutability, immutability, and constness come into the picture, how to avoid a potential problem spot that can arise when passing GC-allocated D strings to C, and how to get D strings from C strings. We’ll save encoding for Part Three.

Thanks to Walter Bright, Ali Çehreli, and Iain Buclaw for their valuable feedback on this article.

Interfacing D with C: Arrays and Functions (Arrays Part 2)

Digital Mars D logo

This post is part of an ongoing series on working with both D and C in the same project. The previous post explored the differences in array declaration and initialization. This post takes the next step: declaring and calling C functions that take arrays as parameters.

Arrays and C function declarations

Using C libraries in D is extremely easy. Most of the time, things work exactly as one would expect, but as we saw in the previous article there can be subtle differences. When working with C functions that expect arrays, it’s necessary to fully understand these differences.

The most straightforward and common way of declaring a C function that accepts an array as a parameter is to to use a pointer in the parameter list. For example, this hypothetical C function:

void f0(int *arr);

In C, any array of int can be passed to this function no matter how it was declared. Given int a[], int b[3], or int *c, the function calls f0(a), f0(b), and f0(c) are all the same: a pointer to the first element of each array is passed to the function. Or using the lingo of C programmers, arrays decay to pointers

Typically, in a function like f0, the implementer will expect the array to have been terminated with a marker appropriate to the context. For example, strings in C are arrays of char that are terminated with the \0 character (we’ll look at D strings vs. C strings in a future post). This is necessary because, without that character, the implementation of f0 has no way to know which element in the array is the last one. Sometimes, a function is simply documented to expect a certain length, either in comments or in the function name, e.g., a vector3f_add(float *vec) will expect that vec points to exactly 3 elements. Another option is to require the length of the array as a separate argument:

void f1(int *arr, size_t len);

None of these approaches is foolproof. If f0 receives an array with no end marker or which is shorter than documented, or if f1 receives an array with an actual length shorter than len, then the door is open for memory corruption. D arrays take this possibility into account, making it much easier to avoid such problems. But again, even D’s safety features aren’t 100% foolproof when calling C functions from D.

There are other, less common, ways array parameters may be declared in C:

void f2(int arr[]);
void f3(int arr[9]);
void f4(int arr[static 9]);

Although these parameters are declared using C’s array syntax, they boil down to the exact same function signature as f0 because of the aforementioned pointer decay. The [9] in f3 triggers no special enforcement by the compiler; arr is still effectively a pointer to int with unknown length. The [9] serves as documentation of what the function expects, and the implementation cannot rely on the array having nine elements.

The only potential difference is in f4. The static added to the declaration tells the compiler that the function must take an array of, in this case, at least nine elements. It could have more than nine, but it can’t have fewer. That also rules out null pointers. The problem is, this isn’t necessarily enforced. Depending on which C compiler you use, if you shortchange the function and send it less than nine elements you might see warnings if they are enabled, but the compiler might not complain at all. (I haven’t tested current compilers for this article to see if any are actually reporting errors for this, or which ones provide warnings.)

The behavior of C compilers doesn’t matter from the D side. All we need be concerned with is declaring these functions appropriately so that we can call them from D such that there are no crashes or unexpected results. Because they are all effectively the same, we could declare them all in D like so:

extern(C):
void f0(int* arr);
void f1(int* arr, size_t len);
void f2(int* arr);
void f3(int* arr);
void f4(int* arr);

But just because we can do a thing doesn’t mean we should. Consider these alternative declarations of f2, f3, and f4:

extern(C):
void f2(int[] arr);
void f3(int[9] arr);
void f4(int[9] arr);

Are there any consequences of taking this approach? The answer is yes, but that doesn’t mean we should default to int* in each case. To understand why, we need first to explore the innards of D arrays.

The anatomy of a D array

The previous article showed that D makes a distinction between dynamic and static arrays:

int[] a0;
int[9] a1;

a0 is a dynamic array and a1 is a static array. Both have the properties .ptr and .length. Both may be indexed using the same syntax. But there are some key differences between them.

Dynamic arrays

Dynamic arrays are usually allocated on the heap (though that isn’t a requirement). In the above case, no memory for a0 has been allocated. It would need to be initialized with memory allocated via new or malloc, or some other allocator, or with an array literal. Because a0 is uninitialized, a0.ptr is null and a0.length is 0.

A dynamic array in D is an aggregate type that contains the two properties as members. Something like this:

struct DynamicArray {
    size_t length;
    size_t ptr;
}

In other words, a dynamic array is essentially a reference type, with the pointer/length pair serving as a handle that refers to the elements in the memory address contained in the ptr member. Every built-in D type has a .sizeof property, so if we take a0.sizeof, we’ll find it to be 8 on 32-bit systems, where size_t is a 4-byte uint, and 16 on 64-bit systems, where size_t is an 8-byte ulong. In short, it’s the size of the handle and not the cumulative size of the array elements.

Static arrays

Static arrays are generally allocated on the stack. In the declaration of a1, stack space is allocated for nine int values, all of which are initialized to int.init (which is 0) by default. Because a1 is initialized, a1.ptr points to the allocated space and a1.length is 9. Although these two properties are the same as those of the dynamic array, the implementation details differ.

A static array is a value type, with the value being all of its elements. So given the declaration of a1 above, its nine int elements indicate that a1.sizeof is 9 * int.sizeof, or 36. The .length property is a compile-time constant that never changes, and the .ptr property, though not readable at compile time, is also a constant that never changes (it’s not even an lvalue, which means it’s impossible to make it point somewhere else).

These implementation details are why we must pay attention when we cut and paste C array declarations into D source modules.

Passing D arrays to C

Let’s go back to the declaration of f2 in C and give it an implementation:

void f2(int arr[]) {
    for(int i=0; i<3; ++i)
        printf("%d\n", arr[i]);
}

A naïve declaration in D:

extern(C) void f2(int[]);

void main() {
    int[] a = [10, 20, 30];
    f2(a);
}

I say naïve because this is never the right answer. Compiling f2.c with df2.d on Windows (cl /c f2.c in the “x64 Native Tools” command prompt for Visual Studio, followed by dmd -m64 df2.d f2.obj), then running df2.exe, shows me the following output:

3
0
1970470928

There is no compiler error because the declaration of f2 is pefectly valid D. The extern(C) indicates that this function uses the cdecl calling convention. Calling conventions affect the way arguments are passed to functions and how the function’s symbol is mangled. In this case, the symbol will be either _f2 or f2 (other calling conventions, like stdcallextern(Windows) in D—have different mangling schemes). The declaration still has to be valid D. (In fact, any D function can be marked as extern(C), something which is necessary when creating a D library that will be called from other languages.)

There is also no linker error. DMD is calling out to the system linker (in this case, Microsoft’s link.exe), the same linker used by the system’s C and C++ compilers. That means the linker has no special knowledge about D functions. All it knows is that there is a call to a symbol, f2 or _f2, that needs to be linked with the implementation. Since the type and number of parameters are not mangled into the symbol name, the linker will happily link with any matching symbol it finds (which, by the way, is the same thing it would do if a C program tried to call a C function which was declared with an incorrect parameter list).

The C function is expecting a single pointer as an argument, but it’s instead receiving two values: the array length followed by the array pointer.

The moral of this story is that any C function with array parameters declared using array syntax, like int[], should be declared to accept pointers in D. Change the D source to the following and recompile using the same command line as before (there’s no need to recompile the C file):

extern(C) void f2(int*);

void main() {
    int[] a = [10, 20, 30];
    f2(a.ptr);
}

Note the use of a.ptr. It’s an error to try to pass a D array argument where a pointer is expected (with one very special exception, string literals, which I’ll cover in the next article in this series), so the array’s .ptr property must be used instead.

The story for f3 and f4 is similar:

void f3(int arr[9]);
void f4(int arr[static 9]);

Remember, int[9] in D is a static array, not a dynamic array. The following do not match the C declarations:

void f3(int[9]);
void f4(int[9]);

Try it yourself. The C implementation:

void f3(int arr[9]) {
    for(int i=0; i<9; ++i)
        printf("%d\n", arr[i]);
}

And the D implementation:

extern(C) void f3(int[9]);

void main() {
    int[9] a = [10, 20, 30, 40, 50, 60, 70, 80, 90];
    f3(a);
}

This is likely to crash, depending on the system. Rather than passing a pointer to the array, this code is instead passing all nine array elements by value! Now consider a C library that does something like this:

typedef float[16] mat4f;
void do_stuff(mat4f mat);

Generally, when writing D bindings to C libraries, it’s a good idea to keep the same interface as the C library. But if the above is translated like the following in D:

alias mat4f = float[16];
extern(C) void do_stuff(mat4f);

The sixteen floats will be passed to do_stuff every time it’s called. The same for all functions that take a mat4f parameter. One solution is just to do the same as in the int[] case and declare the function to take a pointer. However, that’s no better than C, as it allows the function to be called with an array that has fewer elements than expected. We can’t do anything about that in the int[] case, but that will usually be accompanied by a length parameter on the C side anyway. C functions that take typedef’d types like mat4f usually don’t have a length parameter and rely on the caller to get it right.

In D, we can do better:

void do_stuff(ref mat4f);

Not only does this match the API implementor’s intent, the compiler will guarantee that any arrays passed to do_stuff are static float arrays with 16 elements. Since a ref parameter is just a pointer under the hood, all is as it should be on the C side.

With that, we can rewrite the f3 example:

extern(C) void f3(ref int[9]);

void main() {
    int[9] a = [10, 20, 30, 40, 50, 60, 70, 80, 90];
    f3(a);
}

Conclusion

Most of the time, when interfacing with C from D, the C API declarations and any example code can be copied verbatim in D. But most of the time is not all of the time, so care must be taken to account for those exceptional cases. As we saw in the previous article, carelessness when declaring array variables can usually be caught by the compiler. As this article shows, the same is not the case for C function declarations. Interfacing D with C requires the same care as when writing C code.

In the next article in this series, we’ll look at mixing D strings and C strings in the same program and some of the pitfalls that may arise. In the meantime, Steven Schveighoffer’s excellent article, “D Slices”, is a great place to start for more details about D arrays.

Thanks to Walter Bright and Átila Neves for their valuable feedback on this article.

D For Data Science: Calling R from D

Digital Mars D logoD is a good language for data science. The advantages include a pleasant syntax, interoperability with C (in many cases as simple as adding an #include directive to import a C header file via the dpp tool), C-like speed, a large standard library, static typing, built-in unit tests and documentation generation, and a garbage collector that’s there when you want it but can be avoided when you don’t.

Library selection for data science is a different story. Although there are some libraries available, such as those provided by the mir project, the available functionality is extremely limited compared with languages like R and Python. The good news is that it’s possible to call functions in either language from D.

This article shows how to embed an R interpreter inside a D program, pass data between the two languages, execute arbitrary R code from within a D program, and call the R interface to C, C++, and Fortran libraries from D. Although I only provide examples for Linux, the same steps apply for Windows if you’re using WSL, and with minor modifications to the DUB package file, everything should work on macOS. Although it is possible to do so, I don’t talk about calling D functions from R, and I don’t include any discussion of interoperability with Python. (This is normally done using pyd.)

Dependencies

The following three dependencies should be installed:

  • R
  • R package RInsideC
  • R package embedr

It’s assumed that anyone reading this post already has R installed or can install it if they don’t. The RInsideC package is a slightly modified version of the excellent RInside project of Dirk Eddelbuettel and Romain Francois. RInside provides a C++ interface to R. The modifications provide a C interface so that R can be called from any language capable of calling C functions. Install the package using devtools:

library(devtools)
install_bitbucket("bachmeil/rinsidec")

The embedr package provides the necessary functions to work with R from within D. That package is also installed with devtools:

install_bitbucket("bachmeil/embedr")

A First Program

The easiest way to do the compilation is to use D’s package manager, called DUB. From within your project directory, open R and create a project skeleton:

library(embedr)
dubNew()

This will create a /src subdirectory to hold your project’s source code if it doesn’t already exist, add a file called r.d to /src and create a dub.sdl file in the project directory. Create a file in the /src directory called hello.d, containing the following program:

import embedr.r;

void main() {
  evalRQ(`print("Hello, World!")`);
}

From the terminal, in the project directory (the one holding dub.sdl, not the /src subdirectory), enter

dub run

This will print out “Hello, World!”. The important thing to realize is that even though you just used DUB to compile and run a D program, it was R that printed “Hello, World!” to the screen.

Executing R Code From D

There are two ways to execute R code from a D program. evalR executes a string in R and returns the output to D, while evalRQ does the same thing but suppresses the output. evalRQ also accepts an array of strings that are executed sequentially.

Create a new project directory and run dubNew inside it, as you did for the first example. In the src/ subdirectory, add a file named reval.d:

import embedr.r;
import std.stdio;

void main() {
  // Example 1
  evalRQ(`print(3+2)`); // evaluates to 5 in R, R prints the output [1] 5 to the screen

  // Example 2
  writeln(evalR(`3+2`).scalar); // evaluates to 5 in R, output is 5

  // Example 3
  evalRQ(`3+2`); // evaluates to 5 in R, but there is no output

  // Example 4
  evalRQ([`x <- 3`, `y <- 2`, `z <- x+y`, `print(z)`]); // evaluates this code in R
}

Example 1 tells R to print the sum of 3 and 2. Because we use evalRQ, no output is returned to D, but R is able to print to the screen. Example 2 evaluates 3+2 in R and returns the output to D in the form of an Robj. evalR(``3+2``).scalar executes 3+2 inside R, captures the output in an Robj, and converts the Robj into a double holding the value 5. This value is passed to the writeln function and printed to the screen. Example 3 doesn’t output anything, because evalRQ does not return any output, and R isn’t being told to print anything to the screen. Example 4 executes the four strings in the array sequentially, returning nothing to D, but the last tells R to print the value of z to the screen.

There’s not much more to say about executing R code from D. You can execute any valid R code from D, and if there’s an error, it will be caught and printed to the screen. Graphical output is automatically captured in a PDF file. To work interactively with R, or if it’s sufficient to save the results to a text file and read them into D, this is all you need to know. The more interesting cases involve passing data between D and R, and for the times when there is no alternative, using the R interface to call directly into C, C++, or Fortran libraries.

Passing Data Between D and R

A little background is needed to understand how to pass data between D and R. Everything in R is represented as a C struct named SEXPREC, and a pointer to a SEXPREC struct is called a SEXP in the R source code. Those names reflect R’s origin as a Scheme dialect, where code takes the form of s-expressions. In order to avoid misunderstanding, embedr uses the name Robj instead of SEXP.

It’s necessary to let R allocate the memory for any data passed to R. For instance, you cannot tell D to allocate a double[] array and then pass a pointer to that array to R. You would instead do something like this:

auto v = RVector(100);
foreach(ii; 0..100) {
  v[ii] = 1.5*ii;
}
v.toR("vv");
evalRQ(`print(vv)`);

The first line tells R to allocate a vector with room for 100 elements. v is a D struct holding a pointer to the memory allocated by R plus additional information that allows you to read and change the elements of the vector. Behind the scenes, the RVector struct protects the vector from R’s garbage collector. R is a garbage collected language, and if the only reference to the data is in your D program, there’s nothing to prevent the R garbage collector from freeing that memory. The RVector struct uses the reference counting mechanism described in Adam Ruppe’s D Cookbook to protect objects from R’s garbage collector and unprotect them when they’re no longer in use.

After filling in all 100 elements of v, the toR function creates a new variable in R called vv, and associates it with the vector held inside v. The final line tells R to print out the variable vv.

In practice, no data is ever passed between D and R. The only thing that’s passed around is a single pointer to the memory allocated by R. That means it’s practical to call R functions from D even for very large datasets.

Calling the R API

The R API provides a convenient (by C standards) interface to some of R’s functions and constants, including the numerical optimization routines underlying optim, distribution functions, and random number generators. This example shows how to solve an unconstrained nonlinear optimization problem using the Nelder-Mead algorithm, which is the default when calling optim in R.

The objective function is

f = x^2 + y^2

We want to choose x and y to minimize f. The obvious solution is x=0 and y=0.

Create a new project directory and initialize DUB from within R, with the one additional step to add the wrapper for R’s optimization libraries:

library(embedr)
dubNew()
dubOptim()

dubOptim() adds the file optim.d to the src/ directory. Create a file called nelder.d inside the src directory with the following program:

import embedr.r, embedr.optim;
import std.stdio;

extern(C) {
  double f(int n, double * par, void * ex) {
    return par[0]*par[0] + par[1]*par[1];
  }
}

void main() {
  auto nm = NelderMead(&f);
  OptimSolution sol = nm.solve([3.5, -5.5]);
  sol.print;
}

First we define the objective function, f, using the C calling convention so it can be passed to various C functions. We then create a new struct called NelderMead, passing a pointer to f to its constructor. Finally, we call the solve method, using [3.5, -5.5] as the array of starting values, and print out the solution. You’ll want to confirm that the failure code in the output is false (implying the convergence criterion was met). The most common reason that Nelder-Mead will fail to converge is because it took too many iterations. To change the maximum number of iterations to 10,000, you’d add nm.maxit = 10_000; to your program before the call to nm.solve.

There’s no overhead associated with calling an interpreted language in this example. We’re calling a C shared library directly, and at no point does the R interpreter get involved. As in the previous example, since there’s no copying of data, this approach is efficient even for large datasets. Finally, if you’re not comfortable with garbage collection, the inner loops of the optimization are done entirely in C. We nonetheless do take advantage of the convenience and safety of D’s garbage collector when allocating the nm and sol structs, as the performance advantages of manual memory management (to the extent that there are any) are irrelevant.

Calling R Interfaces from D

The purpose of many R packages is to provide a convenient interface to a C, C++, or Fortran library. The term “R interface” normally means one of two things. For modern C or C++ code, it’s a function taking Robj structs as arguments and returning one Robj struct as the output. For Fortran code and older C or C++ code, it’s a void function taking pointers as arguments. In either case, you can call the R interface directly from D code, meaning any library with an R interface also has a D interface.

An example of an R interface to Fortran code is found in the popular glmnet package.
Lasso estimation using the elnet function is done by passing 28 pointers to the function elnet in libglmnet.so with this interface:

.Fortran("elnet", ka, parm=alpha, nobs, nvars, as.double(x), y,
                  weights, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh,
                  isd, intr, maxit, lmu=integer(1), a0=double(nlam),
                  ca=double(nx*nlam), ia=integer(nx), nin=integer(nlam),
                  rsq=double(nlam), alm=double(nlam), nlp=integer(1),
                  jerr=integer(1), PACKAGE="glmnet")

You might want to work with the R interface directly if you’re calling elnet inside a loop in your D program. Most of the time it’s better to pass the data to R and then call the R function that calls elnet. Calling Fortran functions can be error-prone, leading to hard to debug segmentation faults.

Conclusion

D was designed from the beginning to be compatible with the C ABI. The intention was to facilitate the integration of new D code into existing C code bases. The practical result has been that, due to C’s lingua franca status, D can be used in combination with myriad languages. Data scientists looking for alternatives to C and C++ when working with R may find benefit in giving D a close look.

Lance Bachmeier is an associate professor of economics at Kansas State University and co-editor of the journal Energy Economics. He does research on macroeconomics and energy economics. He has been using the D programming language in his research since 2013.

Writing a D Wrapper for a C Library

In porting to D a program I created for a research project, I wrote a D wrapper of a C library in an object-oriented manner. I want to share my experience with other programmers. This article provides some D tips and tricks for writers of D wrappers around C libraries.

I initially started my research project using the Ada 2012 programming language (see my article “Experiences on Writing Ada Bindings for a C Library” in Ada User Journal, Volume 39, Number 1, March 2018). Due to a number of bugs that I was unable to overcome, I started looking for another programming language. After some unsatisfying experiments with Java and Python, I settled on the D programming language.

The C Library

We have a C library, written in an object-oriented style (C structure pointers serve as objects, and C functions taking such structure pointers serve as methods). Fortunately for us, there is no inheritance in that C library.

The particular libraries we will deal with are the Redland RDF Libraries, a set of libraries which parse Resource Description Framework (RDF) files or other RDF resources, manages them, enables RDF queries, etc. Don’t worry if you don’t know what RDF is, it is not really relevant for this article.

The first stage of this project was to write a D wrapper over librdf. I modeled it on the Ada wrapper I had already written. One advantage I found in D over Ada is that template instantiation is easier—there’s no need in D to instantiate every single template invocation with a separate declaration. I expect this to substantially simplify the code of XML Boiler, my program which uses this library.

I wrote both raw bindings and a wrapper. The bindings translate the C declarations directly into D, and the wrapper is a new API which is a full-fledged D interface. For example, it uses D types with constructors and destructors to represent objects. It also uses some other D features which are not available in C. This is a work in progress and your comments are welcome.

The source code of my library (forked from Dave Beckett’s original multi-language bindings of his libraries) is available at GitHub (currently only in the dlang branch). Initially, I tried some automatic parsers of C headers which generate D code. I found these unsatisfactory, so I wrote the necessary bindings myself.

Package structure

I put my entire API into the rdf.* package hierarchy. I also have the rdf.auxiliary package and its subpackages for things used by or with my bindings. I will discuss some particular rdf.auxiliary.* packages below.

My mixins

In Ada I used tagged types, which are a rough equivalent of D classes, and derived _With_Finalization types from _Without_Finalization types (see below). However, tagged types increase variable sizes and execution time.

In D I use structs instead of classes, mainly for efficiency reasons. D structs do not support inheritance, and therefore have no virtual method table (vtable), but do provide constructors and destructors, making classes unnecessary for my use case (however, see below). To simulate inheritance, I use template mixins (defined in the rdf.auxiliary.handled_record module) and the alias this construct.

As I’ve said above, C objects are pointers to structures. All C pointers to structures have the same format and alignment (ISO/IEC 9899:2011 section 6.2.5 paragraph 28). This allows the representation of any pointer to a C structure as a pointer to an opaque struct (in the below example, URIHandle is an opaque struct declared as struct URIHandle;).

Using the mixins shown below, we can declare the public structs of our API this way (you should look into the actual source for real examples):

struct URIWithoutFinalize {
    mixin WithoutFinalize!(URIHandle,
                           URIWithoutFinalize,
                           URI,
                           raptor_uri_copy);
    // …
}
struct URI {
    mixin WithFinalize!(URIHandle,
                        URIWithoutFinalize,
                        URI,
                        raptor_free_uri);
}

The difference between the WithoutFinalize and WithFinalize mixins is explained below.

About finalization and related stuff

The main challenge in writing object-oriented bindings for a C library is finalization.

In the C library in consideration (as well as in many other C libraries), every object is represented as a pointer to a dynamically allocated C structure. The corresponding D object can be a struct holding the pointer (aka handle), but oftentimes a C function returns a so-called “shared handle”—a pointer to a C struct which we should not free because it is a part of a larger C object and shall be freed by the C library only when that larger C object goes away.

As such, I first define both (for example) URIWithoutFinalize and URI. Only URI has a destructor. For URIWithoutFinalize, a shared handle is not finalized. As D does not support inheritance for structs, I do it with template mixins instead. Below is a partial listing. See the above URI example on how to use them:

mixin template WithoutFinalize(alias Dummy,
                               alias _WithoutFinalize,
                               alias _WithFinalize,
                               alias copier = null)
{
    private Dummy* ptr;
    private this(Dummy* ptr) {
        this.ptr = ptr;
    }
    @property Dummy* handle() const {
        return cast(Dummy*)ptr;
    }
    static _WithoutFinalize fromHandle(const Dummy* ptr) {
        return _WithoutFinalize(cast(Dummy*)ptr);
    }
    static if(isCallable!copier) {
        _WithFinalize dup() {
            return _WithFinalize(copier(ptr));
        }
    }
    // ...
}


mixin template WithFinalize(alias Dummy,
                            alias _WithoutFinalize,
                            alias _WithFinalize,
                            alias destructor,
                            alias constructor = null)
{
    private Dummy* ptr;
    @disable this();
    @disable this(this);
    // Use fromHandle() instead
    private this(Dummy* ptr) {
        this.ptr = ptr;
    }
    ~this() {
        destructor(ptr);
    }
    /*private*/ @property _WithoutFinalize base() { // private does not work in v2.081.2
        return _WithoutFinalize(ptr);
    }
    alias base this;
    @property Dummy* handle() const {
        return cast(Dummy*)ptr;
    }
    static _WithFinalize fromHandle(const Dummy* ptr) {
        return _WithFinalize(cast(Dummy*)ptr);
    }
    // ...
}

I’ve used template alias parameters here, which allow a template to be parameterized with more than just types. The Dummy argument is the type of the handle instance (usually an opaque struct). The destructor and copier arguments are self-explanatory. For the usage of the constructor argument, see the real source (here it is omitted).

The _WithoutFinalize and _WithFinalize template arguments should specify the structs we define, allowing them to reference each other. Note that the alias this construct makes _WithoutFinalize essentially a base of _WithFinalize, allowing us to use all methods and properties of _WithoutFinalize in _WithFinalize.

Also note that instances of the _WithoutFinalize type may become invalid, i.e. it may contain dangling access values. It seems that there is no easy way to deal with this problem because of the way the C library works. We may not know when an object is destroyed by the C library. Or we may know but be unable to appropriately “explain” it to the D compiler. Just be careful when using this library not to use objects which are already destroyed.

Dealing with callbacks

To deal with C callbacks (particularly when accepting a void* argument for additional data) in an object-oriented way, we need a way to convert between C void pointers and D class objects (we pass D objects as C “user data” pointers). D structs are enough (and are very efficient) to represent C objects like librdf library objects, but for conveniently working with callbacks, classes are more useful because they provide good callback machinery in the form of virtual functions.

First, the D object, which is passed as a callback parameter to C, should not unexpectedly be moved in memory by the D garbage collector. So I make them descendants of this class:

class UnmovableObject {
    this() {
        GC.setAttr(cast(void*)this, GC.BlkAttr.NO_MOVE);
    }
}

Moreover, I add the property context() to pass it as a void* pointer to C functions which register callbacks:

abstract class UserObject : UnmovableObject {
    final @property void* context() const { return cast(void*)this; }
}

When we create a callback we need to pass a D object as a C pointer and an extern(C) function defined by us as the callback. The callback receives the pointer previously passed by us and in the callback code we should (if we want to stay object-oriented) convert this pointer into a D object pointer.

What we need is a bijective (“back and forth”) mapping between D pointers and C void* pointers. This is trivial in D: just use the cast() operator.

How to do this in practice? The best way to explain is with an example. We will consider how to create an I/O stream class which uses the C library callbacks to implement it. For example, when the user of our wrapper requests to write some information to a file, our class receives write message. To handle this message, our implementation calls our virtual function doWriteBytes(), which actually handles the user’s request.

private immutable DispatcherType Dispatch =
    { version_: 2,
      init: null,
      finish: null,
      write_byte : &raptor_iostream_write_byte_impl,
      write_bytes: &raptor_iostream_write_bytes_impl,
      write_end  : &raptor_iostream_write_end_impl,
      read_bytes : &raptor_iostream_read_bytes_impl,
      read_eof   : &raptor_iostream_read_eof_impl };


class UserIOStream : UserObject {
    IOStream record;
    this(RaptorWorldWithoutFinalize world) {
        IOStreamHandle* handle = raptor_new_iostream_from_handler(world.handle,
                                                                  context,
                                                                  &Dispatch);
        record = IOStream.fromNonnullHandle(handle);
    }
    void doWriteByte(char byte_) {
        if(doWriteBytes(&byte_, 1, 1) != 1)
            throw new IOStreamException();
    }
    abstract int doWriteBytes(char* data, size_t size, size_t count);
    abstract void doWriteEnd();
    abstract size_t doReadBytes(char* data, size_t size, size_t count);
    abstract bool doReadEof();
}

And for example:

int raptor_iostream_write_bytes_impl(void* context, const void* ptr, size_t size, size_t nmemb) {
    try {
        return (cast(UserIOStream)context).doWriteBytes(cast(char*)ptr, size, nmemb);
    }
    catch(Exception) {
        return -1;
    }
}

More little things

I “encode” C strings (which can be null) as a D template instance, Nullable!string. If the string is null, the holder is empty. However, it is often enough to transform an empty D string into a null C string (this can work only if we don’t differentiate between empty and null strings). See rdf.auxiliary.nullable_string for an actually useful code.

I would write a lot more advice on how to write D bindings for a C library, but you can just follow my source, which can serve as an example.

Static if

One thing which can be done in D but not in Ada is compile-time comparison via static if. This is a D construct (similar to but more advanced than C conditional preprocessor directives) which allows conditional compilation based on compile-time values. I use static if with my custom Version type to enable/disable features of my library depending on the available features of the version of the base C library in use. In the following example, rasqalVersionFeatures is a D constant defined in my rdf.config package, created by the GNU configure script from the config.d.in file.

static if(Version(rasqalVersionFeatures) >= Version("0.9.33")) {
    private extern extern(C)
    QueryResultsHandle* rasqal_new_query_results_from_string(RasqalWorldHandle* world,
                                                             QueryResultsType type,
                                                             URIHandle* base_uri,
                                                             const char* string,
                                                             size_t string_len);
    static create(RasqalWorldWithoutFinalize world,
                  QueryResultsType type,
                  URITypeWithoutFinalize baseURI,
                  string value)
    {
        return QueryResults.fromNonnullHandle(
            rasqal_new_query_results_from_string(world.handle,
                                                 type,
                                                 baseURI.handle,
                                                 value.ptr, value.length));
    }
}

Comparisons

Order comparisons between structs can be easily done with this mixin:

mixin template CompareHandles(alias equal, alias compare) {
    import std.traits;
    bool opEquals(const ref typeof(this) s) const {
        static if(isCallable!equal) {
          return equal(handle, s.handle) != 0;
        } else {
          return compare(handle, s.handle) == 0;
        }
    }
    int opCmp(const ref typeof(this) s) const {
      return compare(handle, s.handle);
    }
}

Sadly, this mixin has to be called in both the _WithoutFinalization and the _WithFinalization structs. I found no solution to write it once.

Conclusion

I’ve found that D is a great language for writing object-oriented wrappers around C libraries. There are some small annoyances like using class wrappers around structs for callbacks, but generally, D wraps up around C well.


Victor Porton is an open source developer, a math researcher, and a Christian writer. He earns his living as a programmer.

Interfacing D with C: Arrays Part 1

This post is part of an ongoing series on working with both D and C in the same project. The previous post showed how to compile and link C and D objects. This post is the first in a miniseries focused on arrays.

When interacting with C APIs, it’s almost a given that arrays are going to pop up in one way or another (perhaps most often as strings, a subject of a future article in the “D and C” series). Although D arrays are implemented in a manner that is not directly compatible with C, the fundamental building blocks are the same. This makes compatibility between the two relatively painless as long as the differences are not forgotten. This article is the first of a few exploring those differences.

When using a C API from D, it’s sometimes necessary to translate existing code from C to D. A new D program can benefit from existing examples of using the C API, and anyone porting a program from C that uses the API would do well to keep the initial port as close to the original as possible. It’s on that basis that we’re starting off with a look at the declaration and initialization syntax in both languages and how to translate between them. Subsequent posts in this series will cover multidimensional arrays, the anatomy of a D array, passing D arrays to and receiving C arrays from C functions, and how the GC fits into the picture.

My original concept of covering this topic was much smaller in scope, my intent to brush over the boring details and assume that readers would know enough of the basics of C to derive the why from the what and the how. That was before I gave a D tutorial presentation to a group among whom only one person had any experience with C. I’ve also become more aware that there are regular users of the D forums who have never touched a line of C. As such, I’ll be covering a lot more ground than I otherwise would have (hence a two-part article has morphed into at least three). I urge those for whom much of said ground is old hat not to get complacent in their skimming of the page! A comfortable experience with C is more apt than none at all to obscure some of the pitfalls I describe.

Array declarations

Let’s start with a simple declaration of a one-dimensional array:

int c0[3];

This declaration allocates enough memory on the stack to hold three int values. The values are stored contiguously in memory, one right after the other. c0 may or may not be initialized, depending on where it’s declared. Global variables and static local variables are default initialized to 0, as the following C program demonstrates.

definit.c

#include <stdio.h>

// global (can also be declared static)
int c1[3];

void main(int argc, char** argv)
{
    static int c2[3];       // static local
    int c3[3];              // non-static local

    printf("one: %i %i %i\n", c1[0], c1[1], c1[2]);
    printf("two: %i %i %i\n", c2[0], c2[1], c2[2]);
    printf("three: %i %i %i\n", c3[0], c3[1], c3[2]);
}

For me, this prints:

one: 0 0 0
two: 0 0 0
three: -1 8 0

The values for c3 just happened to be lying around at that memory location. Now for the equivalent D declaration:

int[3] d0;

Try it online

Here we can already find the first gotcha.

A general rule of thumb in D is that C code pasted into a D source file should either work as it does in C or fail to compile. For a long while, C array declaration syntax fell into the former category and was a legal alternative to the D syntax. It has since been deprecated and subsequently removed from the language, meaning int d0[3] will now cause the compiler to scold you:

Error: instead of C-style syntax, use D-style int[3] d0

It may seem an arbitrary restriction, but it really isn’t. At its core, it’s about consistency at a couple of different levels.

One is that we read declarations in D from right to left. In the declaration of d0, everything flows from right to left in the same order that we say it: “(d0) is an (array of three) (integers)”. The same is not true of the C-style declaration.

Another is that the type of d0 is actually int[3]. Consider the following pointer declarations:

int* p0, p1;

The type of both p0 and p1 is int* (in C, only p0 would be a pointer; p1 would simply be an int). It’s the same as all type declarations in D—type on the left, symbol on the right. Now consider this:

int d1[3], d2[3];
int[3] d4, d5;

Having two different syntaxes for array declarations, with one that splits the type like an infinitive, sets the stage for the production of inconsistent and potentially confusing code. By making the C-style syntax illegal, consistency is enforced. Code readability is a key component of maintainability.

Another difference between d0 and c0 is that the elements of d0 will be default initialized no matter where or how it’s declared. Module scope, local scope, static local… it doesn’t matter. Unless the compiler is told otherwise, variables in D are always default initialized to the predefined value specified by the init property of each type. Array elements are initialized to the init property of the element type. As it happens, int.init == 0. Translate definit.c to D and see it for yourself (open up run.dlang.io and give it a go).

When translating C to D, this default initialization business is a subtle gotcha. Consider this innocently contrived C snippet:

// static variables are default initialized to 0 in C
static float vertex[3];
some_func_that_expects_inited_vert(vertex);

A direct translation straight to D will not produce the expected result, as float.init == float.nan, not 0.0f!

When translating between the two languages, always be aware of which C variables are not explicitly initialized, which are expected to be initialized, and the default initialization value for each of the basic types in D. Failure to account for the subtleties may well lead to debugging sessions of the hair-pulling variety.

Default initialization can easily be disabled in D with = void in the declaration. This is particularly useful for arrays that are going to be loaded with values before they’re read, or that contain elements with an init value that isn’t very useful as anything other than a marker of uninitialized variables.

float[16] matrix = void;
setIdentity(matrix);

On a side note, the purpose of default initialization is not to provide a convenient default value, but to make uninitialized variables stand out (a fact you may come to appreciate in a future debugging session). A common mistake is to assume that types like float and char, with their “not a number” (float.nan) and invalid UTF–8 (0xFF) initializers, are the oddball outliers. Not so. Those values are great markers of uninitialized memory because they aren’t useful for much else. It’s the integer types (and bool) that break the pattern. For these types, the entire range of values has potential meaning, so there’s no single value that universally shouts “Hey! I’m uninitialized!”. As such, integer and bool variables are often left with their default initializer since 0 and false are frequently the values one would pick for explicit initialization for those types. Floating point and character values, however, should generally be explicitly initialized or assigned to as soon as possible.

Explicit array initialization

C allows arrays to be explicitly initialized in different ways:

int ci0[3] = {0, 1, 2};  // [0, 1, 2]
int ci1[3] = {1};        // [1, 0, 0]
int ci2[]  = {0, 1, 2};  // [0, 1, 2]
int ci3[3] = {[2] = 2, [0] = 1}; // [1, 0, 2]
int ci4[]  = {[2] = 2, [0] = 1}; // [1, 0, 2]

What we can see here is:

  • elements are initialized sequentially with the constant values in the initializer list
  • if there are fewer values in the list than array elements, then all remaining elements are initialized to 0 (as seen in ci1)
  • if the array length is omitted from the declaration, the array takes the length of the initializer list (ci2)
  • designated initializers, as in ci3, allow specific elements to be initialized with [index] = value pairs, and indexes not in the list are initialized to 0
  • when the length is omitted from the declaration and a designated initializer is used, the array length is based on the highest index in the initializer and elements at all unlisted indexes are initialized to 0, as seen in ci4

Initializers aren’t supposed to be longer than the array (gcc gives a warning and initializes a three-element array to the first three initializers in the list, ignoring the rest).

Note that it’s possible to mix the designated and non-designated syntaxes in a single initializer:

// [0, 1, 0, 5, 0, 0, 0, 8, 44]
int ci5[] = {0, 1, [3] = 5, [7] = 8, 44};

Each value without a designation is applied in sequential order as normal. If there is a designated initializer immediately preceding it, then it becomes the value for the next index, and all other elements are initialized to 0. Here, 0 and 1 go to indexes ci5[0] and ci5[1] as normal, since they are the first two values in the list. Next comes a designator for ci5[3], so ci5[2] has no corresponding value in this list and is initialized to 0. Next comes the designator for ci5[7].  We have skipped ci5[4], ci5[5], and ci5[6],  so they are all initialized to 0. Finally, 44 lacks a designator, but immediately follows [7], so it becomes the value for the element at ci5[8]. In the end, ci5 is initialized to a length of 9 elements.

Also note that designated array initializers were added to C in C99. Some C compiler versions either don’t support the syntax or require a special command line flag to enable it. As such, it’s probably not something you’ll encounter very much in the wild, but still useful to know about when you do.

Translating all of these to D opens the door to more gotchas. Thankfully, the first one is a compiler error and won’t cause any heisenbugs down the road:

int[3] wrong = {0, 1, 2};
int[3] right = [0, 1, 2];

Array initializers in D are array literals. The same syntax can be used to pass anonymous arrays to functions, as in writeln([0, 1, 2]). For the curious, the declaration of wrong produces the following compiler error:

Error: a struct is not a valid initializer for a int[3]

The {} syntax is used for struct initialization in D (not to be confused with struct literals, which can also be used to initialize a struct instance).

The next surprise comes in the translation of ci1.

// int ci1[3] = {1};
int[3] di1 = [1];

This actually produces a compiler error:

Error: mismatched array lengths, 3 and 1

What gives? First, take a look at the translation of ci2:

// int ci2[] = {0, 1, 2};
int[] di2 = [0, 1, 2];

In the C code, there is no difference between ci1 and ci2. They both are fixed-length, three-element arrays allocated on the stack. In D, this is one case where that general rule of thumb about pasting C code into D source modules breaks down.

Unlike C, D actually makes a distinction between arrays of types int[3] and int[]. The former is, like C, a fixed-length array, commonly referred to in D as a static array. The latter, unlike C, is a dynamic-length array, commonly referred to as a dynamic array or a slice. Its length can grow and shrink as needed.

Initializers for static arrays must have the same length as the array. D simply does not allow initializers shorter than the declared array length. Dynamic arrays take the length of their initializers. di2 is initialized with three elements, but more can be appended. Moreover, the initializer is not required for a dynamic array. In C, int foo[]; is illegal, as the length can only be omitted from the declaration when an initializer is present.

// gcc says "error: array size missing in 'illegalC'"
// int illegalC[]
int[] legalD;
legalD ~= 10;

legalD is an empty array, with no memory allocated for its elements. Elements can be added via the append operator, ~=.

Memory for dynamic arrays is allocated at the point of declaration only when an explicit initializer is provided, as with di2. If no initializer is present, memory is allocated when the first element is appended. By default, dynamic array memory is allocated from the GC heap (though the compiler may determine that it’s safe to allocate on the stack as an optimization) and space for more elements than needed is initialized in order to reduce the need for future allocations (the reserve function can be used to allocate a large block in one go, without initializing any elements). Appended elements go into the preallocated slots until none remain, then the next append triggers a new allocation. Steven Schveighoffer’s excellent array article goes into the details, and also describes array features we’ll touch on in the next part.

Often, when translating a declaration like ci2 to D, the difference between the fixed-length, stack-allocated C array and the dynamic-length, GC-allocated D array isn’t going to matter one iota. One case where it does matter is when the D array is declared inside a function marked @nogc:

@nogc void main()
{
    int[] di2 = [0, 1, 2];
}

Try it online

The compiler ain’t letting you get away with that:

Error: array literal in @nogc function D main may cause a GC allocation

The same error isn’t triggered when the array is static, since it’s allocated on the stack and the literal elements are just shoved right in there. New C programmers coming to D for the first time tend to reach for @nogc almost as if it goes against their very nature not to, so this is something they will bump into until they eventually come to the realization that the GC is not the enemy of the people.

To wrap this up, that big paragraph on designated array initializers in C is about to pull double duty. D also supports designated array initializers, just with a different syntax.

// [0, 1, 0, 5, 0, 0, 0, 8, 44]
// int ci5[] = {0, 1, [3] = 5, [7] = 8, 44};
int[] di5 = [0, 1, 3:5, 7:8, 44];
int[9] di6 = [0, 1, 3:5, 7:8, 44];

Try it online

It works with both static and dynamic arrays, following the same rules and producing the same initialization values as in C.

The main takeaways from this section are:

  • there is a distinction in D between static and dynamic arrays, in C there is not
  • static arrays are allocated on the stack
  • dynamic arrays are allocated on the GC heap
  • uninitialized static arrays are default initialized to the init property of the array elements
  • dynamic arrays can be explicitly initialized and take the length of the initializer
  • dynamic arrays cannot be explicitly initialized in @nogc scopes
  • uninitialized dynamic arrays are empty

This is the time on the D Blog when we dance

There are a lot more words in the preceding sections than I had originally intended to write about array declarations and initialization, and I still have quite a bit more to say about arrays. In the next post, we’ll look at the anatomy of a D array and dig into the art of passing D arrays across the language divide.

Interfacing D with C: Getting Started

One of the early design goals behind the D programming language was the ability to interface with C. To that end, it provides ABI compatibility, allows access to the C standard library, and makes use of the same object file formats and system linkers that C and C++ compilers use. Most built-in D types, even structs, are directly compatible with their C counterparts and can be passed freely to C functions, provided the functions have been declared in D with the appropriate linkage attribute. In many cases, one can copy a chunk of C code, paste it into a D module, and compile it with minimal adjustment. Conversely, appropriately declared D functions can be called from C.

That’s not to say that D carries with it all of C’s warts. It includes features intended to eliminate, or more easily avoid, some of the errors that are all too easy to make in C. For example, bounds checking of arrays is enabled by default, and a safe subset of the language provides compile-time enforcement of memory safety. D also changes or avoids some things that C got wrong, such as what Walter Bright sees as C’s biggest mistake: conflating pointers with arrays. It’s in these differences of implementation that surprises lurk for the uninformed.

This post is the first in a series exploring the interaction of D and C in an effort to inform the uninformed. I’ve previously written about the basics of this topic in an article at GameDev.net, and in my book, ‘Learning D’, where the entirety of Chapter 9 covers it in depth.

This blog series will focus on those aforementioned corner cases so that it’s not necessary to buy the book or to employ trial and error in order to learn them. As such, I’ll leave the basics to the GameDev.net article and recommend that anyone interfacing D with C (or C++) give it a read along with the official documentation.

The C and D code that I provide to highlight certain behavior is intended to be compiled and linked by the reader. The code demonstrates both error and success conditions. Recognizing and understanding compiler errors is just as important as knowing how to fix them, and seeing them in action can help toward that end. That implies some prerequisite knowledge of compiling and linking C and D source files. Happily, that’s the focus of the next section of this post.

For the C code, we’ll be using the Digital Mars C/C++ and Microsoft C/C++ compilers on Windows, and GCC and Clang elsewhere. On the D side, we’ll be working exclusively with the D reference compiler, DMD. Windows users unfamiliar with setting up DMD to work with the Microsoft tools will be well served by the post on this blog titled, ‘DMD, Windows, and C’.

We’ll finish the post with a look at one of the corner cases, one that is likely to rear its head early on in any exploration of interfacing D with C, particularly when creating bindings to existing C libraries.

Compiling and linking

The articles in this series will present example C source code that is intended to be saved and compiled into object files for linking with D programs. The command lines for generating the object files look pretty much the same on every platform, with a couple of caveats. We’ll look first at Windows, then lump all the other supported systems together in a single section.

In the next two sections, we’ll be working with the following C and D source files. Save them in the same directory (for convenience) and make sure to keep the names distinct. If both files have the same name in the same directory, then the object files created by the C compiler and DMD will also have the same name, causing the latter to overwrite the former. There are compiler switches to get around this, but for a tutorial we’re better off keeping the command lines simple.

chello.c

#include <stdio.h>
void say_hello(void) 
{
    puts("Hello from C!");
}

hello.d

extern(C) void say_hello();
void main() 
{
    say_hello();
}

The extern(C) bit in the declaration of the C function in the D code is a linkage attribute. That’s covered by the other material I referenced above, but it’s a potential gotcha we’ll look at later in this series.

Windows

The official DMD packages for Windows, available at dlang.org as a zip archive and an installer, are the only released versions of DMD that do not require any additional tooling to be installed as a prerequisite to compile D files. These packages ship with everything they need to compile 32-bit executables in the OMF format (again, I refer you to ‘DMD, Windows, and C’ for the details).

When linking any foreign object files with a D program, it’s important that the object file format and architecture match the D compiler output. The former is an issue primarily on Windows, while attention must be paid to the latter on all platforms.

Compiling C source to a format compatible with vanilla DMD on Windows requires the Digital Mars C/C++ compiler. It’s a free download and ships with some of the same tools as DMD. It outputs object files in the OMF format. With both it and DMD installed and on the system path, the above source files can be compiled, linked, and executed like so:

dmc -c chello.c
dmd hello.d chello.obj
hello

The -c option tells DMC to forego linking, causing it to only compile the C source and write out the object file chello.obj.

To get 64-bit output on Windows, DMC is not an option. In that case, DMD requires the Microsoft build tools on Windows. Once the MS build tools are installed and set up, open the preconfigured x64 Native Tools Command Prompt from the Start menu and execute the following commands (again, see ‘D, Windows, and C’ on this blog for information on how to get the Microsoft build tools and open the preconfigured command prompt, which may have a slightly different name depending on the version of Visual Studio or the MS Build Tools installed):

cl /c chello.c
dmd -m64 hello.d chello.obj
hello

Again, the /c option tells the compiler not to link. To produce 32-bit output with the MS compiler, open a preconfigured x86 Native Tools Command Prompt and execute these commands:

cl /c hello.c
dmd -m32mscoff hello.c chello.obj
hello

DMD recognizes the -m32 switch on Windows, but that tells it to produce 32-bit OMF output (the default), which is not compatible with Microsoft’s linker, so we must use -m32mscoff here instead.

Other platforms

On the other platforms D supports, the system C compiler is likely going to be GCC or Clang, one of which you will already have installed if you have a functioning dmd command. On Mac OS, clang can be installed via XCode in the App Store. Most Linux and BSD systems have a GCC package available, such as via the often recommended command line, apt-get install build-essential, on Debian and Debian-based systems. Please see the documentation for your system for details.

On these systems, the environment variable CC is often set to the system compiler command. Feel free to substitute either gcc or clang for CC in the lines below as appropriate for your system.

CC -c chello.c
dmd hello.d chello.o
./hello

This will produce either 32-bit or 64-bit output, depending on your system configuration. If you are on a 64-bit system and have 32-bit developer tools installed, you can pass -m32 to both CC and dmd to generate 32-bit binaries.

The long way

Now that we’re configured to compile and link C and D source in the same binary, let’s take a look at a rather common gotcha. To fully appreciate this one, it helps to compile it on both Windows and another platform.

One of the features of D is that all of the integral types have a fixed size. A short is always 2 bytes and an int is always 4. This never changes, no matter the underlying system architecture. This is quite different from C, where the spec only imposes relative requirements on the size of each integral type and leaves the specifics to the implementation. Even so, there are wide areas of agreement across modern compilers such that on every platform D currently supports the sizes for almost all the integral types match those in D. The exceptions are long and ulong.

In D, long and ulong are always 8 bytes across all platforms. This never changes. It lines up with the corresponding C types just fine on most 64-bit systems under the version(Posix) umbrella, where the C long and unsigned long are also 8 bytes. However, they are 4 bytes on 32-bit architectures. Moreover, they’re always 4 bytes on Windows, even on a 64-bit architecture.

Most C code these days will account for these differences either by using the preprocessor to define custom integral types or by making use of the C99 stdint.h where types such as int32_t and int64_t are unambiguously defined. Yet, it’s still possible to encounter C libraries using long in the wild.

Consider the following C function:

maxval.c

#include <limits.h>
unsigned long max_val(void)
{
    return ULONG_MAX;
}

The naive D implementation looks like this:

showmax1.d

extern(C) ulong max_val();
void main()
{
    import std.stdio : writeln;
    writeln(max_val());
}

What this does depends on the C compiler and architecture. For example, on Windows with dmc I get 7316910580432895, with x86 cl I get 59663353508790271, and 4294967295 with x64 cl. The last one is actually the correct value, even though the size of the unsigned long on the C side is still 4 bytes as it is in the other two scenarios. I assume this is because the x64 ABI stores return values in the 8-byte RAX register, so it can be read into the 8-byte ulong on the D side with no corruption. The important point here is that the two values in the x86 code are garbage because the D side is expecting a 64-bit return value from 32-bit registers, so it’s reading more than it’s being given.

Thankfully, DRuntime provides a way around this in core.c.config, where you’ll find c_long and c_ulong. Both of these are conditionally configured to match the compile-time C runtime implementation and architecture configuration. With this, all that’s needed is to change the declaration of max_val in the D module, like so:

showmax2.d

import core.stdc.config : c_ulong;
extern(C) c_ulong max_val();

void main()
{
    import std.stdio : writeln;
    writeln(max_val());
}

Compile and run with this and you’ll find it does the right thing everywhere. On Windows, it’s 4294967295 across the board.

Though less commonly encountered, core.stdc.config also declares a portable c_long_double type to match any long double that might pop up in a C library to which a D module must bind.

Looking ahead

In this post, we’ve gotten set up to compile and link C and D in the same executable and have looked at the first of several potential problem spots. We used DMD here, but it should be possible to substitute one of the other D compilers (ldc or gdc) without changing the command line (with the exception of -m32mscoff, which is specific to DMD). The next post in this series will focus entirely on getting D arrays and C arrays to cooperate. See you there!