{"id":2878,"date":"2021-05-24T13:48:44","date_gmt":"2021-05-24T13:48:44","guid":{"rendered":"https:\/\/dlang.org\/blog\/?p=2878"},"modified":"2021-09-30T13:30:11","modified_gmt":"2021-09-30T13:30:11","slug":"interfacing-d-with-c-strings-part-one","status":"publish","type":"post","link":"https:\/\/dlang.org\/blog\/2021\/05\/24\/interfacing-d-with-c-strings-part-one\/","title":{"rendered":"Interfacing D with C: Strings Part One"},"content":{"rendered":"<p><img loading=\"lazy\" src=\"https:\/\/dlang.org\/blog\/wp-content\/uploads\/2016\/08\/d6.png\" alt=\"Digital Mars D logo\" width=\"200\" height=\"200\" class=\"alignleft size-full wp-image-181\" \/><\/p>\n<p>This post is part of <a href=\"https:\/\/dlang.org\/blog\/the-d-and-c-series\/\">an ongoing series<\/a> on working with both D and C in the same project. The previous two posts looked into interfacing D and C arrays. Here, we focus on a special kind of array: strings. Readers are advised to read <a href=\"https:\/\/dlang.org\/blog\/2018\/10\/17\/interfacing-d-with-c-arrays-part-1\/\">Arrays Part One<\/a> and <a href=\"https:\/\/dlang.org\/blog\/2020\/04\/28\/interfacing-d-with-c-arrays-and-functions-arrays-part-two\/\">Arrays Part Two<\/a> before continuing with this one.<\/p>\n<h3 id=\"thesamebutdifferent\">The same but different<\/h3>\n<p>D strings and C strings are both implemented as arrays of character types, but they have nothing more in common. Even that one similarity is only superficial. We&#8217;ve seen in previous blog posts that D arrays and C arrays are different under the hood: a C array is effectively a pointer to the first element of the array (or, in C parlance, <a href=\"https:\/\/stackoverflow.com\/questions\/1461432\/what-is-array-to-pointer-decay\">C arrays decay to pointers<\/a>, except <a href=\"https:\/\/stackoverflow.com\/questions\/17752978\/exceptions-to-array-decaying-into-a-pointer\">when they don&#8217;t<\/a>); a D dynamic array is a <em>fat pointer<\/em>, i.e., a length and pointer pair. A D array does not decay to a pointer, i.e., it cannot be implicitly assigned to a pointer or bound to a pointer parameter in an argument list. Example:<\/p>\n<pre class=\"prettyprint lang-d\">extern(C) void metamorphose(int* a, size_t len);\n\nvoid main() {\n    int[] a = [8, 4, 30];\n    metamorphose(a, a.length);      \/\/ Error - a is not int*\n    metamorphose(a.ptr, a.length);  \/\/ Okay\n}<\/pre>\n<p>Beyond that, we&#8217;ve got further incompatibilities:<\/p>\n<ul>\n<li>each of D&#8217;s three string types, <code>string<\/code>, <code>wstring<\/code>, and <code>dstring<\/code>, are encoded as Unicode: UTF-8, UTF-16, and UTF-32 respectively. The C <code>char*<\/code> <em>can<\/em> be encoded as UTF-8, but it isn&#8217;t required to be. Then there&#8217;s the C <code>wchar_t*<\/code>, which differs in bit size between implementations, never mind encoding.<\/li>\n<li>all of D&#8217;s string types are dynamic arrays with immutable contents, i.e., <code>string<\/code> is an alias to <code>immutable(char)[]<\/code>. C strings are mutable by default.<\/li>\n<li>the last character of every C string is required to be the NUL character (the escape character <code>\\0<\/code>, which is encoded as <code>0<\/code> in most character sets); D strings are not required to be NUL-terminated.<\/li>\n<\/ul>\n<p>It may appear at first blush as if passing D and C strings back and forth can be a major headache. In practice, that isn&#8217;t the case at all. In this and subsequent posts, we&#8217;ll see how easy it can be. In this post, we start by looking at how we can deal with NUL termination and wrap up by digging deeper into the related topic of how string literals are stored in memory.<\/p>\n<h3 id=\"nultermination\">NUL termination<\/h3>\n<p>Let&#8217;s get this out of the way first: when passing a D string to C, the programmer must ensure it is terminated with <code>\\0<\/code>. <a href=\"https:\/\/dlang.org\/phobos\/std_string.html#.toStringz\"><code>std.string.toStringz<\/code><\/a>, a simple utility function in <a href=\"https:\/\/dlang.org\/phobos\/index.html\">the D standard library (Phobos)<\/a>, can be employed for this:<\/p>\n<pre class=\"prettyprint lang-d\">import core.stdc.stdio : puts;\nimport std.string : toStringz;\n\nvoid main() {\n    string s0 = &quot;Hello C &quot;;\n    string s1 = s0 ~ &quot;from D!&quot;;\n    puts(s1.toStringz());\n}<\/pre>\n<p><code>toStringz<\/code> takes a single argument of type <code>const(char)[]<\/code> and returns <code>immutable(char)*<\/code> (there&#8217;s more about <code>const<\/code> vs. <code>immutable<\/code> in Part Two). The form <code>s1.toStringz<\/code>, known as <a href=\"https:\/\/tour.dlang.org\/tour\/en\/gems\/uniform-function-call-syntax-ufcs\">UFCS (Uniform Function Call Syntax)<\/a>, is lowered by the compiler into <code>toStringz(s1)<\/code>.<\/p>\n<p><code>toStringz<\/code> is the idiomatic approach, but it&#8217;s also possible to append <code>&quot;\\0&quot;<\/code> manually. In that case, <code>puts<\/code> can be supplied with the string&#8217;s pointer directly:<\/p>\n<pre class=\"prettyprint lang-d\">import core.stdc.stdio : puts;\n\nvoid main() {\n    string s0 = &quot;Hello C &quot;;\n    string s1 = s0 ~ &quot;from D!&quot; ~ &quot;\\0&quot;;\n    puts(s1.ptr);\n}<\/pre>\n<p>Forgetting to use <code>.ptr<\/code> will result in a compilation error, but forget to append the <code>&quot;\\0&quot;<\/code> and who knows when someone will catch it (possibly after a crash in production and one of those marathon debugging sessions which can make some programmers wish they had never heard of programming). So prefer <code>toStringz<\/code> to avoid such headaches.<\/p>\n<p>However, because strings in D are immutable, <code>toStringz<\/code> does allocate memory from the GC heap. The same is true when manually appending <code>\"\\0\"<\/code> with the append operator. If there&#8217;s a requirement to avoid garbage collection at the point where the C function is called, e.g., in a <code>@nogc<\/code> function or when <code>-betterC<\/code> is enabled, it will have to be done in the same manner as in C, e.g., by allocating\/reallocating space with <code>malloc\/realloc<\/code> (or some other allocator) and copying the NUL terminator. (Also note that, in some situations, passing pointers to GC-managed memory from D to C can result in unintended consequences. We&#8217;ll dig into what that means, and how to avoid it, in Part Two.)<\/p>\n<p>None of this applies when we&#8217;re dealing directly with string literals, as they get a bit of special treatment from the compiler that makes <code>puts(&quot;Hello D from C!&quot;.toStringz)<\/code> redundant. Let&#8217;s see why.<\/p>\n<h4 id=\"stringliteralsindarespecial\">String literals in D are special<\/h4>\n<p>D programmers very often find themselves passing string literals to C functions. Walter Bright recognized early on how common this would be and decided that it needed to be just as seamless in D as it is in C. So he implemented string literals in a way that mitigates the two major incompatibilities that arise from NUL terminators and differences in array internals:<\/p>\n<ol>\n<li>D string literals are implicitly NUL-terminated.<\/li>\n<li>D string literals are implicitly convertible to <code>const(char)*<\/code>.<\/li>\n<\/ol>\n<p>These two features may seem minor, but they are quite major in terms of convenience. That&#8217;s why I didn&#8217;t pass a literal to <code>puts<\/code> in the <code>toStringz<\/code> example. With a literal, it would look like this:<\/p>\n<pre class=\"prettyprint lang-d\">import core.stdc.stdio : puts;\n\nvoid main() {\n    puts(&quot;Hello C from D!&quot;);\n}<\/pre>\n<p>No need for <code>toStringz<\/code>. No need for manual NUL termination or <code>.ptr<\/code>. It just works.<\/p>\n<p>I want to emphasize that this only applies to string <em>literals<\/em> (of type <code>string<\/code>, <code>wstring<\/code>, and <code>dstring<\/code>) and not to string variables; once a string literal is included in an expression, the NUL-termination guarantee goes out the window. Also, no other array literal type is implicitly convertible to a pointer, so the <code>.ptr<\/code> property must be used to bind them to a pointer function parameter, e.g., <code>`giveMeIntPointer([1, 2, 3].ptr)<\/code>.<\/p>\n<p>But there is a little more to this story.<\/p>\n<h4 id=\"stringliteralsinmemory\">String literals in memory<\/h4>\n<p>Normal array literals will usually trigger a GC allocation (unless the compiler can elide the allocation, such as when assigning the literal to a static array). Let&#8217;s do a bit of digging to see what happens with a D string literal:<\/p>\n<pre class=\"prettyprint lang-d\">import std.stdio;\n\nvoid main() {\n    writeln(&quot;Where am I?&quot;);\n}<\/pre>\n<p>To make use of a command-line tool particularly convenient for this example, I compiled the above on 64-bit Linux with all three major compilers using the following command lines:<\/p>\n<pre>dmd -ofdmd-memloc memloc.d\ngdc -o gdc-memloc memloc.d\nldc2 -ofldc-memloc memloc.d<\/pre>\n<p>If we were compiling C or C++, we could expect to find string literals in the read-only data segment, <code>.rodata<\/code>, of the binary. So let&#8217;s look there via the <code>readelf<\/code> command, which allows us to extract specific data from binaries in the elf object file format, to see if the same thing happens with D. The following is abbreviated output for each binary:<\/p>\n<pre>readelf -x .rodata .\/dmd-memloc | less\nHex dump of section '.rodata':\n  0x0008e000 01000200 00000000 00000000 00000000 ................\n  0x0008e010 04100000 00000000 6d656d6c 6f630000 ........memloc..\n  0x0008e020 57686572 6520616d 20493f00 2f757372 Where am I?.\/usr\n  0x0008e030 2f696e63 6c756465 2f646d64 2f70686f \/include\/dmd\/pho\n...\n\nreadelf -x .rodata .\/gdc-memloc | less\nHex dump of section '.rodata':\n  0x00003000 01000200 00000000 57686572 6520616d ........Where am\n  0x00003010 20493f00 00000000 2f757372 2f6c6962  I?.....\/usr\/lib\n...\n\nreadelf -x .rodata .\/ldc-memloc | less\nHex dump of section '.rodata':\n  0x00001e40 57686572 6520616d 20493f00 00000000 Where am I?.....\n  0x00001e50 2f757372 2f6c6962 2f6c6463 2f783836 \/usr\/lib\/ldc\/x86<\/pre>\n<p>In all three cases, the string is right there in the read-only data segment. The D spec explicitly avoids specifying where a string literal will be stored, but in practice, we can bank on the following: it might be in the binary&#8217;s read-only segment, or it might be in the normal data segment, but it won&#8217;t trigger a GC allocation, and it won&#8217;t be allocated on the stack.<\/p>\n<p>Wherever it is, there&#8217;s a positive consequence that we can sometimes take advantage of. Notice in the <code>readelf<\/code> output that there is a dot (<code>.<\/code>) immediately following the question mark at the end of each string. That represents the NUL terminator. It is not counted in the string&#8217;s <code>.length<\/code> (so <code>&quot;Where am I?&quot;.length<\/code> is 11 and not 12), but it&#8217;s still there. So when we initialize a string variable with a string literal or assign a string literal to a variable, the lack of an allocation also means there&#8217;s no copying, which in turn means the variable is pointing to the literal&#8217;s location in memory. And that means we can safely do this:<\/p>\n<pre class=\"prettyprint lang-d\">import core.stdc.stdio: puts;\n\nvoid main() {\n    string s = &quot;I'm NUL-terminated.&quot;;\n    puts(s.ptr);\n    s = &quot;And so am I.&quot;;\n    puts(s.ptr);\n}<\/pre>\n<p>If you&#8217;ve read <a href=\"https:\/\/dlang.org\/blog\/the-gc-series\/\">the GC series on this blog<\/a>, you are aware that the GC can only have a chance to run a collection if an attempt is made to allocate from the GC heap. More allocations mean a higher chance to trigger a collection and more memory that needs to be scanned when a collection runs. Many applications may never notice, but it&#8217;s a good policy to avoid GC allocations when it&#8217;s easy to do so. The above is a good example of just that: <code>toStringz<\/code> allocates, we don&#8217;t need it in either call to <code>puts<\/code> because we can trust that <code>s<\/code> is NUL-terminated, so we don&#8217;t use it.<\/p>\n<p>To be very clear: this is only true for string variables that have been directly initialized with a string literal or assigned one. If the value of the variable was the result of any other operation, then it cannot be considered NUL-terminated. Examples:<\/p>\n<pre class=\"prettyprint lang-d\">string s1 = s ~ &quot;...I'm Unreliable!!&quot;;\nstring s2 = s ~ s1;\nstring s3 = format(&quot;I'm %s!!&quot;, &quot;Unreliable&quot;);<\/pre>\n<p>None of these strings can be considered NUL-terminated. Each case will trigger a GC allocation. The runtime pays no mind to the NUL terminator of any of the literals during the append operations or in the <code>format<\/code> function, so the programmer can&#8217;t trust it will be part of the result. Pass any one of these strings to C without first terminating it and trouble will eventually come knocking.<\/p>\n<h4 id=\"butholdon...\">But hold on&#8230;<\/h4>\n<p>Given that you&#8217;re reading a D blog, you&#8217;re probably adventurous or like experimenting. That may lead you to discover another case that looks reliable:<\/p>\n<pre class=\"prettyprint lang-d\">import core.stdc.stdio: puts;\n\nvoid main() {\n    string s = &quot;Am I &quot; ~ &quot;reliable?&quot;;\n    puts(s.ptr);\n}<\/pre>\n<p>The above very much looks like appending multiple string literals in an initialization or assignment is just as reliable as using a single string literal. We can strengthen that assumption with the following:<\/p>\n<pre class=\"prettyprint lang-d\">import std.stdio : writeln;\n\nvoid main() {\n    writeln(&quot;Am I reliable?&quot;.ptr);\n\n    string s = &quot;Am I &quot; ~ &quot;reliable?&quot;;\n    writeln(s.ptr);\n}<\/pre>\n<p><code>writeln<\/code> is a templated function that recognizes when it&#8217;s being given a pointer; rather than treating it as a string and printing what it points to, it prints the pointer&#8217;s value. So we can print memory addresses in D without a format string.<\/p>\n<p>Compiling the above, again on 64-bit Linux:<\/p>\n<pre>dmd -ofdmd-rely rely.d\ngdc -o gdc-rely rely.d\nldc2 -ofldc-rely rely.d<\/pre>\n<p>Now let&#8217;s execute them all:<\/p>\n<pre>.\/dmd-rely\n562363F63010\n562363F63030\n\n.\/gdc-rely\n5566145E0008\n5566145E0008\n\n.\/ldc-rely\n55C63CFB461C\n55C63CFB461C<\/pre>\n<p>We see that <code>dmd-rely<\/code> prints two different addresses, but they&#8217;re very close together. Both <code>gdc-rely<\/code> and <code>ldc-rely<\/code> print a single address in both cases. And if we make use of <code>readelf<\/code> as we did with the <code>memloc<\/code> example above, we&#8217;ll find that, in every case, the literals are in the read-only data segment. Case closed!<\/p>\n<p>Well, not so fast.<\/p>\n<p>What&#8217;s happening is that all three compilers are performing an optimization <a href=\"https:\/\/en.wikipedia.org\/wiki\/Constant_folding\">known as <em>constant folding<\/em><\/a>. In short, they can recognize when all operands of an append expression are compile-time constants, so they can perform the append at compile-time to produce a single string literal. In this case, the net effect is the same as <code>s = &quot;Am I reliable?&quot;<\/code>. LDC and GDC go further and recognize that the resulting literal is identical to the one used earlier, so they reuse the existing literal&#8217;s address (<a href=\"https:\/\/en.wikipedia.org\/wiki\/String_interning\">a.k.a. string interning<\/a>). (Note that DMD also performs string interning, but currently it only kicks in when a string literal appears more than twice.)<\/p>\n<p>To be clear: this only works because all of the operands are string literals. No matter how many string literals are involved in an operation, if only one operand is a variable, then the operation triggers a GC allocation.<\/p>\n<p>Although we see that the result of an append operation involving string literals can be passed directly to C just fine, and we&#8217;ve proven that it&#8217;s stored in read-only memory alongside its NUL terminator, <em>this is not something we should consider reliable<\/em>. It&#8217;s an optimization that no compiler is required to perform. Though it&#8217;s unlikely that any of the three major D compilers will suddenly stop constant folding string literals, a future D compiler could possibly be released without this particular optimization and instead trigger a GC allocation.<\/p>\n<p>In short: rely on this at your own risk.<\/p>\n<p><strong>Addendum<\/strong>: Compile <code>rely.d<\/code> on Windows with dmd and the binary will yield some very different output:<\/p>\n<pre>dmd -m64 -ofwin-rely.exe rely.d\n.\/win-rely\n7FF76701D440\n7FF76702BB30<\/pre>\n<p>There is a much bigger difference in the memory addresses here than in the dmd binary on Linux. We&#8217;re dealing with the PE\/COFF format in this case, and I&#8217;m not familiar with anything similar to <code>readelf<\/code> for that format on Windows. But I do know a little something about <a href=\"https:\/\/www.agner.org\/optimize\/#objconv\">Abner Fog&#8217;s <code>objconv<\/code> utility<\/a>. Not only does it convert between object file formats, but it can also disassemble them:<\/p>\n<pre>objconv -fasm win-rely.obj<\/pre>\n<p>This produces a file, <code>win-rely.asm<\/code>. Open it in a text editor and search for a portion of the string, e.g., <code>&quot;I rel&quot;<\/code>. You&#8217;ll find the two entries aren&#8217;t too far apart, but one is located in a block of text under this heading:<\/p>\n<blockquote><p>\nrdata SEGMENT PARA &#8216;CONST&#8217;       ; section number 4\n<\/p><\/blockquote>\n<p>And the other under this heading:<\/p>\n<blockquote><p>\n.data$B SEGMENT PARA &#8216;DATA&#8217;        ; section number 6\n<\/p><\/blockquote>\n<p>In other words, one of them is in the read-only data segment (<code>rdata SEGMENT PARA 'CONST'<\/code>), and the other is in the regular data segment. This goes back to what I mentioned earlier about the D spec being explicitly silent on where string literals are stored. Regardless, the behavior of the program on Windows is the same as it is on Linux; the second call to <code>puts<\/code> doesn&#8217;t blow anything up because the NUL terminator is still there, one slot past the last character. But it doesn&#8217;t change the fact that constant folding of appended string literals is an optimization and only to be relied upon at your own risk.<\/p>\n<h3 id=\"conclusion\">Conclusion<\/h3>\n<p>This post provides all that&#8217;s needed for many of the use cases encountered with strings when interacting with C from D, but it&#8217;s not the complete picture. In Part Two, we&#8217;ll look at how mutability, immutability, and constness come into the picture, how to avoid a potential problem spot that can arise when passing GC-allocated D strings to C, and how to get D strings from C strings. We&#8217;ll save encoding for Part Three.<\/p>\n<p><em>Thanks to Walter Bright, Ali \u00c7ehreli, and Iain Buclaw for their valuable feedback on this article.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>This post is part of an ongoing series on working with both D and C in the same project. The previous two posts looked into interfacing D and C arrays. Here, we focus on a special kind of array: strings. Readers are advised to read Arrays Part One and Arrays Part Two before continuing with [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[26,29,30],"tags":[],"_links":{"self":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/2878"}],"collection":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/comments?post=2878"}],"version-history":[{"count":10,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/2878\/revisions"}],"predecessor-version":[{"id":2889,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/2878\/revisions\/2889"}],"wp:attachment":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/media?parent=2878"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/categories?post=2878"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/tags?post=2878"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}