Snowflake Strings

Posted on

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century.


Consider the following D code in file.d:

int foo(int i) {
    assert(i < 3);
    return i;
}

This is equivalent to the C code:

#include <assert.h>

int foo(int i) {
    assert(i < 3);
    return i;
}

The assert() in the D code is “lowered” (i.e. rewritten by the compiler) to the following:

(i < 3 || _d_assertp("file.d", 2))

We’re interested in how the compiler writes that string literal, "file.d" to the generated object file. The most obvious implementation is to write the characters into the data section and push the address of that to call _d_assertp().

Indeed, that does work, and it’s tempting to stop there. But since we’re professional compiler nerds obsessed with performance, details, etc., there’s a lot more oil in that olive (to borrow one of Andrei’s favorite sayings). Let’s put it in a press and start turning the screw, because assert()s are used a lot.

First off, string literals are immutable (they were originally mutable in C and C++, but are no longer, and D tries to learn from such mistakes). This suggests the string can
be put in read-only memory. What advantages does that deliver?

  • Read-only memory is, well, read only. Attempts to write to it are met with a seg fault exception courtesy of the CPUs memory management logic. Seg faults are a bit graceless, like a cop pulling you over for driving on the wrong side of the road, but at least there wasn’t a head-on collision or corruption of the program’s memory.
  • Read-only pages are never swapped out by the virtual memory system; they never get marked as “dirty” because they are never written to. They may get discarded and reloaded, but that’s much less costly.
  • Read-only memory is safe from corruption by malware (unless the malware infects the MMU, sigh).
  • Read-only memory in a shared library is shared – copies do not have to be made for each user of the shared library.
  • Read-only memory does not need to be scanned for pointers to the heap by the garbage collection system (if the application does GC).

Essentially, shoving as much as possible into read-only memory is good for performance, size and safety. There’s the first drop of oil.

The next issue crops up as soon as there’s more than one assert:

int foo(int i, int j) {
    assert(i < 3);
    assert(j & 1);
    return i + j;
}

The compiler emits two copies of "file.d" into the object file. Since they’re identical, and read-only, it makes sense to only emit one copy:

string TMP = "file.d";
int foo(int i, int j) {
    (i < 3 || _d_assertp(TMP, 2))
    (j & 1 || _d_assertp(TMP, 3))
    return i + j;
}

This is called string pooling and is fairly simple to implement. The compiler maintains a hash table of string literals and their associated symbol names (TMP in this case).

So far, this is working reasonably well. But when asserts migrate into header files, macros, and templates, the same string can appear in lots of object files, since the compiler doesn’t know what is happening in other object files (the separate compilation model). Other string literals can exhibit this behavior, too, when generic coding practices are used. There needs to be some way to present these in the object file so the linker can pool identical strings.

The dmd D compiler currently supports four different object file formats on different platforms:

  • Elf, for Linux and FreeBSD
  • Mach-O, for OSX
  • MS-COFF, for Win64
  • OMF, for Win32

Each does it in a different way, with different tradeoffs. The methods tend to be woefully under documented, and figuring this stuff out is why I get paid the big bucks.

Elf

Elf turns out to have a magic section just for this purpose. It’s named .rodata.strM.N where N is replace by the number of bytes a character has, and M is the alignment. For good old char strings, that would be .rodata.str1.1. The compiler just dumps the strings into that section, and the Elf linker looks through it, removing the redundant strings and adjusting the relocations accordingly. It’ll handle the usual string types – char, wchar, and dchar – with aplomb.

There’s just a couple flaws. The end of a string is determined by a nul character. This means that strings cannot have embedded nuls, or the linker will regard them as multiple strings and shuffle them about in unexpected ways. One cannot have relocations in those sections, either. This means it’s only good for C string literals, not other kinds of data.

This poses a problem for D, where the strings are length-delineated strings, not nul-terminated ones. Does this mean D is doomed to being unable to take advantage of the C-centric file formats and linker design? Not at all. The D compiler simply appends a nul when emitting string literals. If the string does have an embedded nul (allowed in D), it is not put it in these special sections (and the benefit is lost, but such strings are thankfully rare).

Mach-O

Mach-O uses a variant of the Elf approach, a special section named __cstring. It’s more limited in that it only works with single byte chars. No wchar_ts for you! If there ever was confirmation that UTF-16 and UTF-32 are dead end string types, this should be it.

MS-COFF

Microsoft invented MS-COFF by extending the old Unix COFF format. It has many magic sections, but none specifically for strings. Instead, it uses what are called COMDAT sections, one for each string. COMDATs are sections tagged with a unique name, and when the linker is faced with multiple COMDATs with the same name, one is picked and all references to the other COMDATs are rewritten to refer to the Anointed One. COMDATs first appeared in object formats with the advent of C++ templates, since template code generation tends to generate the same code over and over in separate files.

(Isn’t it interesting how object file formats are driven by the needs of C and C++?)

The COMDAT for "hello" would look something like this:

??_C@_05CJBACGMB@hello?$AA@:
db 'h', 'e', 'l', 'l', 'o', 0

The tty noise there is the mangled name of the COMDAT which is generated from the string literal’s contents. The algorithm must match across compilation units, as that is how the linker decides which ones are the same (experimenting with it will show that the substring CJBACGMB is some sort of hash). Microsoft’s algorithm for the mangling and hash is undocumented as far as I can determine, but it doesn’t matter anyway, it only has to have a 1:1 mapping between name and string literal. That screams “MD5 hash” to me, so that’s what dmd does. The name is an MD5 hash of the string literal contents, which also has the nice property that no matter how large the string gets, the identifier doesn’t grow.

COMDATs can have anything stuffed in them, so this is a feature that is usable for a lot more than just strings.

The downside of the COMDAT scheme is the space taken up by all those names, so shipping a program with the debug symbols in it could get much larger.

OMF

The caboose is OMF, an ancient format going back to the early 80’s. It was extended with a kludgy COMDAT system to support C++ just before most everyone abandoned it. DMD still emits it for Win32 programs. We’re stuck with it because it’s the only format the default linker (OPTLINK) understands, and so we find a way to press it into service.

Since it has COMDATs, that’s the mechanism used. The wrinkle is that COMDATs are code sections or data sections only; there are no other options. We want it to be read-only, so the string COMDATs are emitted as code sections (!). Hey, it works.

Conclusion

I don’t think we’ve pressed all the oil out of that olive yet. It may be like memcpy, where every new crop of programmers thinks of a way to speed it up.

I hope you’ve enjoyed our little tour of literals, and may all your string literals be unique snowflakes.

Thanks to Mike Parker for his help with this article.

A New Import Idiom

Posted on

Daniel Nielsen is an Embedded Software Engineer. He is currently using D in his spare time for an unpublished Roguelike and warns that he “may produce bursts of D Evangelism”.


I remember one day in my youth, before the dawn of Internet, telling my teachers about “my” new algorithm, only to learn it had been discovered by the ancient Greeks in ~300 BC. This is the story of my life and probably of many who are reading this. It is easy to “invent” something; being the first, not so much!

Anyway, this is what all the fuss is about this time:

template from(string moduleName)
{
  mixin("import from = " ~ moduleName ~ ";");
}

The TL;DR version: A new idiom to achieve even lazier imports.

Before the C programmers start running for the hills, please forget you ever got burned by C++ templates. The above snippet doesn’t look that complicated, now does it? If you enjoy inventing new abstractions, take my advice and give D a try. Powerful, yet an ideal beginner’s language. No need to be a template archwizard.

Before we proceed further, I’d like to call out Andrei Alexandrescu for identifying that there is a problem which needs solving. Please see his in depth motivation in DIP 1005. Many thanks also to Dominikus Dittes Scherkl, who helped trigger the magic spark by making his own counter proposal and questioning if there really is a need to change the language specification in order to obtain Dependency-Carrying Declarations (DIP 1005).

D, like many modern languages, has a fully fledged module system where symbols are directly imported (unlike the infamous C #include). This has ultimately resulted in the widespread use of local imports, limiting the scope as much as possible, in preference to the somewhat slower and less maintainable module-level imports:

// A module-level import
import std.datetime;
  
void fun(SysTime time)
{
  import std.stdio; // A local import
  ...
}

Similar lazy import idioms are possible in other languages, for instance Python.

The observant among you might notice that because SysTime is used as the type of a function parameter, std.datetime must be imported at module level. Which brings us to the point of this blog post (and DIP 1005). How can we get around that?

void fun(from!"std.datetime".SysTime time)
{
  import std.stdio;
  ...
}

There you have it, the Scherkl-Nielsen self-important lookup.

In order to fully understand what’s going on, you may need to learn some D-isms. Let’s break it down.

  1. When instantiating a template (via the ! operator), if the TemplateArgument is one token long, the parentheses can be omitted from the template parameters. So from!"std.datetime" is the same as from!("std.datetime"). It may seem trivial, but you’d be surprised how much readability is improved by avoiding ubiquitous punctuation noise.
  2. Eponymous templates. The declaration of a template looks like this:
    template y() {
        int x;
    }

    With that, you have to type y!().x in order to reach the int. Oh, ze horror! Is that a smiley? Give me x already! That’s exactly what eponymous templates accomplish:

    template x() {
        int x;
    }

    Now that the template and its only member have the same name, x!().x can be shortened to simply x.

  3. Renamed imports allow accessing an imported module via a user-specified namespace. Here, std.stdio is imported normally:
    void printSomething(string s) {
        import std.stdio;
        writeln(s);           // The normal way
        std.stdio.writeln(s)  // An alternative using the fully qualified 
                              // symbol name, for disambiguation
    }

    Now it’s imported and renamed as io:

    void printSomething(string s) {
        import io = std.stdio;
        io.writeln(s);         // Must be accessed like this.
        writeln(s);            // Error
        std.stdio.writeln(s);  // Error
    }

    Combining what we have so far:

    template dt() {
        import dt = std.datetime; 
    }
    void fun(dt!().SysTime time) {}

    It works perfectly fine. The only thing which remains is to make it generic.

  4. String concatenation is achieved with the ~ operator.
    string hey = "Hello," ~ " World!";
    assert(hey == "Hello, World!");
  5. String mixins put the power of a compiler writer at your fingertips. Let’s generate code at compile time, then compile it. This is typically used for domain-specific languages (see Pegged for one prominent use of a DSL in D), but in our simple case we only need to generate one single statement based on the name of the module we want to import. Putting it all together, we get the final form, allowing us to import any symbol from any module inline:
    template from(string moduleName)
    {
      mixin("import from = " ~ moduleName ~ ";");
    }

In the end, is it all really worth the effort? Using one comparison made by Jack Stouffer:

import std.datetime;
import std.traits;

void func(T)(SysTime a, T value) if (isIntegral!T)
{
    import std.stdio : writeln;
    writeln(a, value);
}

Versus:

void func(T)(from!"std.datetime".SysTime a, T value)
    if (from!"std.traits".isIntegral!T)
{
    import std.stdio : writeln;
    writeln(a, value);
}

In this particular case, the total compilation time dropped to ~30% of the original, while the binary size dropped to ~41% of the original.

What about the linker, I hear you cry? Sure, it can remove unused code. But it’s not always as easy as it sounds, in particular due to module constructors (think __attribute__((constructor))). In either case, it’s always more efficient to avoid generating unused code in the first place rather than removing it afterwards.

So this combination of D features was waiting there to be used, but somehow no one had stumbled on it before. I agreed with the need Andrei identified for Dependency-Carrying Declarations, yet I wanted even more. I wanted Dependency-Carrying Expressions. My primary motivation comes from being exposed to way too much legacy C89 code.

void foo(void)
{
#ifdef XXX /* needed to silence unused variable warnings */
  int x;
#endif
... lots of code ...
#ifdef XXX
  x = bar();
#endif
}

Variables or modules, in the end they’re all just symbols. For the same reason C99 allowed declaring variables in the middle of functions, one should be allowed to import modules where they are first used. D already allows importing anywhere in a scope, but not in declarations or expressions. It was with this mindset that I saw Dominikus Dittes Scherkl’s snippet:

fun.ST fun()
{
   import someModule.SomeType;
   alias ST = SomeType;
   ...
}

Clever, yet for one thing it doesn’t adhere to the DRY principle. Still, it was that tiny dot in fun.ST which caused the spark. There it was again, the Dependency-Carrying Expression of my dreams.

Criteria:

  • It must not require repeating fun, since that causes problems when refactoring
  • It must be lazy
  • It must be possible today with no compiler updates

Templates are the poster children of lazy constructs; they don’t generate any code until instantiated. So that seemed a good place to start.

Typically when using eponymous templates, you would have the template turn into a function, type, variable or alias. But why make the distinction? Once again, they’re all just symbols in the end. We could have used an alias to the desired module (see Scherkl’s snippet above); using the renamed imports feature is just a short-cut for import and alias. Maybe it was this simplified view of modules that made me see more clearly.

Now then, is this the only solution? No. As a challenge to the reader, try to figure out what this does and, more importantly, its flaw. Can you fix it?

static struct STD
{
  template opDispatch(string moduleName)
  {
    mixin("import opDispatch = std." ~ moduleName ~ ";");
  }
}