Monthly Archives: December 2017

D’s Newfangled Name Mangling

Rainer Schuetze is the creator and maintainer of Visual D, the D plugin for Visual Studio. Recently, he implemented a new name mangling algorithm for the D frontend, which was released in DMD 2.077.0. In this post, he explains why it was needed and what it does.


What is symbol name mangling?

D embraces the separate compilation model that compiles D source code to object files and uses a linker to bind the object files to an executable binary file. This allows the reuse of precompiled object files and libraries, speeding up the build process. As the linker is usually one that’s also used for other languages with the same compilation model, e.g. C/C++ or Fortran, mixing object files from different languages is straightforward.

In an object file, a symbol name is assigned to each function or global variable, both when it is defined and when it is used via a call or access. The linker uses these names to connect references to definitions of the same name with only very bare knowledge about the symbol. For example, the symbol for this C function declaration,

extern(C) const(char)* find(int ch, const(char)* str);

does not tell the linker anything about function arguments or return type, as the C language uses the plain function name find as the symbol name (some platforms prepend a _ to the symbol). If you later change the order of the arguments to

extern(C) const(char)* find(const(char)* str, int ch);

but fail to update and recompile all source files that use the new declarartion, the linker will happily bind the resulting object files. In that case, the program is likely to crash, since a character passed to the function will be interpreted as a string pointer and vice versa.

D and C++ avoid this problem by adding more information to the symbol name, i.e. they encode into a symbol name the scope in which the symbol is defined, the function argument types, and the return type. Even if the linker does not interpret this information, linking fails with an undefined symbol error if the definitions used to build the object files don’t match. For example, the D function declaration

module test;
extern(D) const(char)* find(int ch, const(char)* str);

has a symbol name _D4test4findFiPxaZPxa, where _D is a prefix to identify the symbol as being generated from a D source symbol, 4test4find encodes the “fully qualified name” find in module test, and FiPxaZPxa describes the function type with an integer argument (designated by i) and the C-style string pointer type Pxa by just concatenating the encodings for argument types. Z terminates the function argument list and is followed by the encoding for the return type, again Pxa for a C-style string pointer. In contrast,

extern(D) const(char)* find(const(char)* str, int ch);

is encoded as _D4test4findFPxaiZPxa, making it a different symbol with the argument types reversed. The encoding ensures a normalized representation of types and scopes while also providing shorter symbols than minimal source code. This encoding is called “name mangling”.

Ed: Note that extern(C) and extern(D) are linkage attributes. When a function is declared in D without an explicit linkage attribute, extern(D) is the default.

In D, some function attributes are also mangled into the symbol name, e.g. @safe, nothrow, pure and @nogc. In theory, mangling could also cover parameter names, user defined attributes, or even contracts, but that is currently considered excessive.

Please note that even though name mangling can detect some mismatches in the binary interface of functions (i.e. how arguments are passed in registers or on the stack), it won’t catch every error; for example, structs, classes and other user defined types are mangled by name only, so that a change to their definition will still pass unnoticed by the linker.

The mangled name of a symbol is also available during compilation using the .mangleof property. This used to be exploited to provide type reflection of the symbol at compile time. This should no longer be necessary due to the introduction of new __traits that make this information accessible faster and more convenient, for example,

__traits(getLinkage,symbol);

or

__traits(getFunctionAttributes, symbol);

Thus, usage of .mangleof is not recommended except for debugging purposes.

When reversing the mangling process in the “demangler”, all the encoded information is kept to make it available to the user, but that does not always yield correct D syntax. The first definition above demangles as

const(char)* test.find(int, const(char)*)

i.e. the module name test is added to the function name.

Template symbols

The two definitions of find shown above can coexist in D and C++, so name mangling is not only a way to detect errors at link time but also a necessity to represent overloads. It should at least contain enough information to distinguish different overloads of the same scoped identifier.

This becomes even more obvious when considering templates that usually instantiate different functions or variable definitions for each argument type. In D, the template instantiation information is added to the qualified name of a symbol.

Consider expression templates, a common example of meta programming used for delayed evaluation of expressions:

module expr;
struct Mul(X,Y)
{
    X x;
    Y y;
}
struct Add(X,Y)
{
    X x;
    Y y;
}

auto mul(X,Y)(X x, Y y) { return Mul!(X,Y)(x, y); }
auto add(X,Y)(X x, Y y) { return Add!(X,Y)(x, y); }

A function template is lowered by the compiler to an eponymous template:

template mul(X, Y)
{
    auto mul(X x, Y y) { return Mul!(X,Y)(x, y); }
}

The template name is part of the qualified function name, expr.mul!(X,Y).mul, and the auto return type is inferred to be Mul!(X,Y). This causes the symbol to reference the types X and Y three times. The demangled mangled name of an instantiation with types double and float of this template is

expr.Mul!(double,float) expr.mul!(double,float).mul(double,float)

The mangling process of DMD before version 2.077 walks the abstract syntax tree of the declaration and emits the mangled representation of the types whenever it is hit. Now consider stacking operations, e.g.

auto square(X)(X x) { return mul(x, x); }

auto len = square("var");
pragma(msg, len.square.mangleof);
// S4expr66__T3MulTS4expr16__T3MulTAyaTAyaZ3MulTS4expr16__T3MulTAyaTAyaZ3MulZ3Mul

pragma(msg, typeof(len).mangleof.length);
pragma(msg, len.square.mangleof.length);
pragma(msg, len.square.square.mangleof.length);
pragma(msg, len.square.square.square.mangleof.length);
pragma(msg, len.square.square.square.square.mangleof.length);
pragma(msg, len.square.square.square.square.square.mangleof.length);
pragma(msg, len.square.square.square.square.square.square.mangleof.length);

With DMD 2.076 or earlier, this displays 28u, 78u, 179u, 381u, 785u, 1594u, 3212u, showing exponential growth of the mangled symbol name length even though the expression in the source code just grows linearly. This happens because types like Mul!(Mul!(string, string), Mul!(string, string)) are combined and the mangling repeats their full representation every time they are referenced.

Create a chain of 12 calls to square above and the symbol length increases to 207,114. Even worse, the resulting object file for COFF/64-bit is larger than 15 MB and the time to compile increases from 0.1 seconds to about 1 minute. Most of that time is spent generating code for functions only used at compile time.

Voldemort types returned from template functions can be similar, as they carry the function signature including the template arguments as part of the type name. These can also show a dramatic increase in build times without generating as much code as in the example.

Symbol compression to the rescue

In early 2016, a couple of attempts were made to shorten these long symbols:

  • cut off symbol names if they exceed a given threshold, but append a checksum of the full symbol instead. This was already done with an MD5 hash when emitting symbols for the DigitalMars C compiler tool chain as the OMF object file format does not allow symbols longer than 255 characters. The downside to this is that these symbols can no longer be demangled, so that symbols in linker messages cannot be translated back into human digestible names.
  • apply binary compression to the symbol name. Standard techniques use part of the full symbol name as a dictionary to encode repetitions within the name. This is usually done by encoding a position-length pair using characters outside the normal identifier set. Again, this is already in use when DMD tries to fit symbols into the OMF limit of 255 characters (before applying the MD5 hash trick), but it also has shown some disadvantages: when using characters above the ASCII range, this interferes with UTF8 encoded characters that are also allowed as symbol characters in the D language. It can also break linker output as the console might misinterpret it as a locale-specific character encoding. Avoiding this by applying a binary to ASCII conversion like base64 to the symbols would obfuscate the actual symbol name even more.
  • extend the mangling grammar by allowing references to entities already encoded. This is similar to binary compression, but does not need to encode match length as the entities have terminators embedded into the grammar. The most prominent entities are types. This is the road C++ has taken, as it is also affected by the issues described here. In C++, name mangling is not standardised by the language, but by the compiler or the platform. GNU g++ uses Itanium C++ ABI mangling, which does a pretty good job with C++ code similar to the example above or in the Voldemort issue. Even though Microsoft’s Visual C++ can encode recurring types as well, it still generates very long names because it limits the encoding to the first 10 types of an argument list.

The first attempts at applying the latter scheme to the mangling of D symbols showed disappointing results. As it turned out, these implementations missed a subtle detail of the mangler in the DMD front end at that time; it reused cached representations of mangled type names to combine them to more complex types. This fails to find repetitions of the types from which a cached type name was built.

This is where I stepped in to create a proof-of-concept version of the mangling without these omissions. Early results were promising, so I looked for more opportunities to reduce symbol length:

  • with fully qualified names always containing the package and module names of a symbol, identifiers tend to appear often in a mangled name.
  • qualified names are likely to come from the same module or package, so it would be nice to encode them as a single entity.

The unit tests of the Phobos runtime library are benchmark candidates, as they contain a lot of symbols for template-heavy code. At the given time there were 127,172 symbols found in the map file of the Windows build. These were the results of the different manglings:

back references max length average length
none 416133 369
types 2095 157
types+identifiers 1263 128
types+identifiers+qualified names 1114 117

(This has been measured with the implementation at the time, which is not exactly the same as the final mangling; it used special characters for the different back reference types, but this turned out not to be a good idea. The D mangling is supposed to be the same on all platforms, but these characters will have a special meaning to the linker on one of them.)

It’s rather simple in DMD to determine the identity of identifiers and types, as the latter are merged according to their mangling anyway. Qualified names and their associated symbols turned out to introduce a number of complications, though. Namely, mangleFunc in core.demangle allows building a mangled name of a function from a fully qualified name given as a string function argument and a type specified as a template argument. Implementing this for run-time usage requires copying the full mangling machinery and introspection capabilities of the compiler, which is unrealistic. Considering the limited benefit shown by the above Phobos statistics, the idea of encoding qualified names was dropped.

Here are some details about the new mangling:

  • Back references are now encoded by the character Q, followed by the relative position of the original appearance of the same identifier or type. These positions are encoded with respect to base 26, with the last digit encoded by a lowercase letter and the other digits encoded by an uppercase letter. That way, most back references are 2 or 3 characters long, 4 in extreme cases. Using a different encoding for the last digit allows determining the end of a number without looking at the next character. This helps to avoid ambiguities. (The Itanium C++ ABI mangling uses base 36 encoding by combining numbers and letters, but need a termination character _.)
  • Counting encodable entities as in the C++ mangling would result in slightly shorter mangled names, but needs the mangler to keep a dynamic list of respective positions. The current demangler is designed not to allocate as long as the supplied output buffer is large enough.
  • Relative positions are chosen instead of absolute positions to allow prepending the _D prefix without having to re-encode the symbol. Some platforms also prepend an additional underscore, for which the relative positions are agnostic.
  • The mangling grammar sometimes allows types and identifiers at the same position, so a demangler needs to distinguish between the two even if given by a back reference. That’s why a lookup to the referenced position is necessary to continue demangling; an identifier always starts with a number, while a type always starts with a letter.
  • Using Q for back references grabs the last free letter used to encode types, but there is at least one type defined in the mangling grammar that is not supposed to appear in a mangling anyway (namely TypeIdent), so it can be resurrected if the necessity appears.

For example, the expression template type shown above now mangles as

pragma(msg, len.square.mangleof);
// S4expr__T3MulTSQo__TQlTAyaTQeZQvTQtZQBb
//                ^^   ^^     ^^ ^^ ^^ ^^^ decode to:
//                |    |      |  |  |  |
//                |    |      |  |  |  +- 3Mul
//                |    |      |  |  +---- S4expr__T3MulTAyaTAyaZ3Mul
//                |    |      |  +------- 3Mul
//                |    |      +---------- Aya
//                |    +----------------- 3Mul
//                +---------------------- 4expr

with a length of 39 instead of 78 without back references. The resulting sizes are 23, 39, 57, 76, 95, 114, 133 showing linear growth. The chain of 12 calls to square shrinks from 207,114 characters to 247, i.e. by a factor of more than 800.

Implementing mangleFunc mentioned above for the mangling with back referencing identifiers still is not obvious; while the fully qualified name is not supposed to contain any types (e.g. as a struct template argument) identifiers in the mangled name can appear again in the function type. This was solved by extending the demangler to use “Design by Introspection” (DbI) (as coined by Andrei Alexandrescu):

  • make the Demangle struct a template that parameterizes on a struct that supplies a couple of hooks
    struct NoHooks {}  // supports: static bool parseLName(ref Demangle); ...
    private struct Demangle(Hooks = NoHooks)
    {
    Hooks hooks;
        // ...
        void parseLName()
        {
            static if(__traits(hasMember, Hooks, "parseLName"))
                if (hooks.parseLName(this))
                    return;
                // normal decode...
        }
    }
  • create a hook that replaces a reoccurring identifier with the appropriate back reference
    struct RemangleHooks
    {
        char[] result;
        size_t[const(char)[]] idpos;
        // ...
        bool parseLName(ref Demangler!RemangleHooks d)
        {
            // flush input so far to result[]
            if (d.front == 'Q')
            {
                // re-encode back reference...
            }
            else if (auto ppos = currentIdentifier in idpos)
            {
                // encode back reference to identifier at *ppos
            }
            else
            {
                idpos[currentIdentifier] = currentPos;
            }
            return true;
        }
    }
  • combine the qualified name and the type as before (core.demangle is still capable of decoding it) and run it through the hooked demangler
    char[] mangleFunc(FuncType)(const(char)[] qualifiedName)
    {
        const(char)mangledQualifiedName = encodeLNames(qualifiedName);
        const(char)mangled = mangledQualifiedName ~ FuncType.mangleof;
        auto d = Demangle!RemangleHooks(mangled, null);
        d.mute = true; // no demangled output
        d.parseMangledName();
        return d.hooks.result;
    }

Is the new mangling sound?

The back references encoded into the mangling extend the existing mangling. Unfortunately, the latter had ambiguities reported to the D issue tracking system, with more of these likely yet to be uncovered. The demangler in core.demangle rejected about 3% of the unmodified symbols from the Phobos unit tests, while 15% were demangled only partially.

It’s tough to verify the soundness of an addition to an already complex and fragile definition, as a change to the mangling would need an update to the tooling (demangler, debuggers). Anyway, it was a good opportunity to get rid of these, too.

So scrutiny of the existing definition was required. To do this mechanically, the mangling specification from the web site was converted into a grammar digestible by the bison parser generator. Bison can create LALR(1) parser tables, which basically means that, while scanning a mangled symbol, looking at a character and its successor is enough to determine whether the character adds to a previous entity or starts a new one. When conflicts are reported when processing a grammar, they might be resolvable with a larger context, but they can also hint at actual problems or undesirable complexity. Adding pseudo-tokens representing handcrafted parser control flow can avoid these conflicts.

This gist shows a grammar for the D mangling scheme without the back references. It still has a couple of conflicts when run through Bison, one of which was determined to be an actual ambiguity in the definition. Adding back references to
the grammar doesn’t add any conflicts.

In addition, core.demangle was fixed to work for all symbols but those exposing the known ambiguities.

Aftermath

Some of the implementations in std.traits used the mangling of a symbol to introspect compile-time properties, for example, to determine the linkage. This was done using a simplified demangler. With the introduction of back references, these
didn’t work any more except for simple symbol names. Using a solution as for core.mangleFunc is feasible, but can slow down compilation considerably as the demangling needs to be executed via CTFE. Fortunately, new __traits have been added which cover all information that can be found in the mangling.

While most users will not notice any changes to their programs other than smaller object and executable file sizes, the new mangling can be a breaking change to external tools like the linker or a debugger. These will continue to work, but until they are updated, be prepared to eventually see the new mangled names for a little while instead of the demangled ones.

Thanks to Mike Parker, Walter Bright and Steven Schveighoffer for review.

Interfacing D with C: Getting Started

One of the early design goals behind the D programming language was the ability to interface with C. To that end, it provides ABI compatibility, allows access to the C standard library, and makes use of the same object file formats and system linkers that C and C++ compilers use. Most built-in D types, even structs, are directly compatible with their C counterparts and can be passed freely to C functions, provided the functions have been declared in D with the appropriate linkage attribute. In many cases, one can copy a chunk of C code, paste it into a D module, and compile it with minimal adjustment. Conversely, appropriately declared D functions can be called from C.

That’s not to say that D carries with it all of C’s warts. It includes features intended to eliminate, or more easily avoid, some of the errors that are all too easy to make in C. For example, bounds checking of arrays is enabled by default, and a safe subset of the language provides compile-time enforcement of memory safety. D also changes or avoids some things that C got wrong, such as what Walter Bright sees as C’s biggest mistake: conflating pointers with arrays. It’s in these differences of implementation that surprises lurk for the uninformed.

This post is the first in a series exploring the interaction of D and C in an effort to inform the uninformed. I’ve previously written about the basics of this topic in an article at GameDev.net, and in my book, ‘Learning D’, where the entirety of Chapter 9 covers it in depth.

This blog series will focus on those aforementioned corner cases so that it’s not necessary to buy the book or to employ trial and error in order to learn them. As such, I’ll leave the basics to the GameDev.net article and recommend that anyone interfacing D with C (or C++) give it a read along with the official documentation.

The C and D code that I provide to highlight certain behavior is intended to be compiled and linked by the reader. The code demonstrates both error and success conditions. Recognizing and understanding compiler errors is just as important as knowing how to fix them, and seeing them in action can help toward that end. That implies some prerequisite knowledge of compiling and linking C and D source files. Happily, that’s the focus of the next section of this post.

For the C code, we’ll be using the Digital Mars C/C++ and Microsoft C/C++ compilers on Windows, and GCC and Clang elsewhere. On the D side, we’ll be working exclusively with the D reference compiler, DMD. Windows users unfamiliar with setting up DMD to work with the Microsoft tools will be well served by the post on this blog titled, ‘DMD, Windows, and C’.

We’ll finish the post with a look at one of the corner cases, one that is likely to rear its head early on in any exploration of interfacing D with C, particularly when creating bindings to existing C libraries.

Compiling and linking

The articles in this series will present example C source code that is intended to be saved and compiled into object files for linking with D programs. The command lines for generating the object files look pretty much the same on every platform, with a couple of caveats. We’ll look first at Windows, then lump all the other supported systems together in a single section.

In the next two sections, we’ll be working with the following C and D source files. Save them in the same directory (for convenience) and make sure to keep the names distinct. If both files have the same name in the same directory, then the object files created by the C compiler and DMD will also have the same name, causing the latter to overwrite the former. There are compiler switches to get around this, but for a tutorial we’re better off keeping the command lines simple.

chello.c

#include <stdio.h>
void say_hello(void) 
{
    puts("Hello from C!");
}

hello.d

extern(C) void say_hello();
void main() 
{
    say_hello();
}

The extern(C) bit in the declaration of the C function in the D code is a linkage attribute. That’s covered by the other material I referenced above, but it’s a potential gotcha we’ll look at later in this series.

Windows

The official DMD packages for Windows, available at dlang.org as a zip archive and an installer, are the only released versions of DMD that do not require any additional tooling to be installed as a prerequisite to compile D files. These packages ship with everything they need to compile 32-bit executables in the OMF format (again, I refer you to ‘DMD, Windows, and C’ for the details).

When linking any foreign object files with a D program, it’s important that the object file format and architecture match the D compiler output. The former is an issue primarily on Windows, while attention must be paid to the latter on all platforms.

Compiling C source to a format compatible with vanilla DMD on Windows requires the Digital Mars C/C++ compiler. It’s a free download and ships with some of the same tools as DMD. It outputs object files in the OMF format. With both it and DMD installed and on the system path, the above source files can be compiled, linked, and executed like so:

dmc -c chello.c
dmd hello.d chello.obj
hello

The -c option tells DMC to forego linking, causing it to only compile the C source and write out the object file chello.obj.

To get 64-bit output on Windows, DMC is not an option. In that case, DMD requires the Microsoft build tools on Windows. Once the MS build tools are installed and set up, open the preconfigured x64 Native Tools Command Prompt from the Start menu and execute the following commands (again, see ‘D, Windows, and C’ on this blog for information on how to get the Microsoft build tools and open the preconfigured command prompt, which may have a slightly different name depending on the version of Visual Studio or the MS Build Tools installed):

cl /c chello.c
dmd -m64 hello.d chello.obj
hello

Again, the /c option tells the compiler not to link. To produce 32-bit output with the MS compiler, open a preconfigured x86 Native Tools Command Prompt and execute these commands:

cl /c hello.c
dmd -m32mscoff hello.c chello.obj
hello

DMD recognizes the -m32 switch on Windows, but that tells it to produce 32-bit OMF output (the default), which is not compatible with Microsoft’s linker, so we must use -m32mscoff here instead.

Other platforms

On the other platforms D supports, the system C compiler is likely going to be GCC or Clang, one of which you will already have installed if you have a functioning dmd command. On Mac OS, clang can be installed via XCode in the App Store. Most Linux and BSD systems have a GCC package available, such as via the often recommended command line, apt-get install build-essential, on Debian and Debian-based systems. Please see the documentation for your system for details.

On these systems, the environment variable CC is often set to the system compiler command. Feel free to substitute either gcc or clang for CC in the lines below as appropriate for your system.

CC -c chello.c
dmd hello.d chello.o
./hello

This will produce either 32-bit or 64-bit output, depending on your system configuration. If you are on a 64-bit system and have 32-bit developer tools installed, you can pass -m32 to both CC and dmd to generate 32-bit binaries.

The long way

Now that we’re configured to compile and link C and D source in the same binary, let’s take a look at a rather common gotcha. To fully appreciate this one, it helps to compile it on both Windows and another platform.

One of the features of D is that all of the integral types have a fixed size. A short is always 2 bytes and an int is always 4. This never changes, no matter the underlying system architecture. This is quite different from C, where the spec only imposes relative requirements on the size of each integral type and leaves the specifics to the implementation. Even so, there are wide areas of agreement across modern compilers such that on every platform D currently supports the sizes for almost all the integral types match those in D. The exceptions are long and ulong.

In D, long and ulong are always 8 bytes across all platforms. This never changes. It lines up with the corresponding C types just fine on most 64-bit systems under the version(Posix) umbrella, where the C long and unsigned long are also 8 bytes. However, they are 4 bytes on 32-bit architectures. Moreover, they’re always 4 bytes on Windows, even on a 64-bit architecture.

Most C code these days will account for these differences either by using the preprocessor to define custom integral types or by making use of the C99 stdint.h where types such as int32_t and int64_t are unambiguously defined. Yet, it’s still possible to encounter C libraries using long in the wild.

Consider the following C function:

maxval.c

#include <limits.h>
unsigned long max_val(void)
{
    return ULONG_MAX;
}

The naive D implementation looks like this:

showmax1.d

extern(C) ulong max_val();
void main()
{
    import std.stdio : writeln;
    writeln(max_val());
}

What this does depends on the C compiler and architecture. For example, on Windows with dmc I get 7316910580432895, with x86 cl I get 59663353508790271, and 4294967295 with x64 cl. The last one is actually the correct value, even though the size of the unsigned long on the C side is still 4 bytes as it is in the other two scenarios. I assume this is because the x64 ABI stores return values in the 8-byte RAX register, so it can be read into the 8-byte ulong on the D side with no corruption. The important point here is that the two values in the x86 code are garbage because the D side is expecting a 64-bit return value from 32-bit registers, so it’s reading more than it’s being given.

Thankfully, DRuntime provides a way around this in core.c.config, where you’ll find c_long and c_ulong. Both of these are conditionally configured to match the compile-time C runtime implementation and architecture configuration. With this, all that’s needed is to change the declaration of max_val in the D module, like so:

showmax2.d

import core.stdc.config : c_ulong;
extern(C) c_ulong max_val();

void main()
{
    import std.stdio : writeln;
    writeln(max_val());
}

Compile and run with this and you’ll find it does the right thing everywhere. On Windows, it’s 4294967295 across the board.

Though less commonly encountered, core.stdc.config also declares a portable c_long_double type to match any long double that might pop up in a C library to which a D module must bind.

Looking ahead

In this post, we’ve gotten set up to compile and link C and D in the same executable and have looked at the first of several potential problem spots. We used DMD here, but it should be possible to substitute one of the other D compilers (ldc or gdc) without changing the command line (with the exception of -m32mscoff, which is specific to DMD). The next post in this series will focus entirely on getting D arrays and C arrays to cooperate. See you there!