Reducing Template Compile Times

Templates have been enormously profitable for the D programming language. They allow the programmer to generate efficient and correct code at compile time. Long gone are the days of preprocessor macros or handwritten, per-type data structures. D templates, though designed in the shadow of C++ templates, were not made in their image. D makes templates cleaner and more expressive, and also enables patterns like “Design by Introspection”.

Here is a simple example of a template that would require the use of preprocessor macros in C or C++:

template sizeOfTypeByName(string name)
{
enum sizeOfTypeByName = mixin(name, ".sizeof");
}
unittest
{
assert(sizeOfTypeByName!"int" == 4);
}

D’s templates are powerful tools but should not be used unthinkingly. Carelessness could result in long compile times or excessive code generation.

In this blog post, I introduce some simple concepts that can help in writing templates that minimize resource usage. Deeper intuition can also lead to the discovery of new abstractions or increased confidence in existing ones.

The D compiler memoizes template instantiations: if I instantiate MyTemplate!int once, the compiler produces an AST for that instantiation; if I instantiate that exact template again, the previous computation is reused.

As a demonstration, let’s write a generic addition function and use pragma(msg, ...) to print the number of instantiations at compile time. I’m going to use it twice with integers and twice with floating point numbers.

auto genericAdd(T)(const T x, const T y)
{
pragma(msg, "genericAdd instantiated with ", T);
return x + y;
}

// Instantiate with int

// Now for the float type
writeln(genericAdd!float(0.0, float.nan));

Now let’s compile the code and look at what our pragma(msg, ) says about template instantiations in the compiler.

dmd -c generic_add.d

This yields the following output during compilation:

genericAdd instantiated with int
genericAdd instantiated with float

We can see int and float as we expected, but notice that each is only mentioned once. Newcomers to languages with templates or generics can sometimes mistakenly think that using a template requires a potentially expensive instantiation on every use in the source code. For the benefit of those new users of D, the above is categoric proof that this is not the case; you cannot pay twice for templates you have already asked the compiler to instantiate. (You can, however, convince yourself that you are asking the compiler to do something it’s already done when you in fact are not. We’ll go over contrived and real-world examples of this later in this article.)

The benefit of this feature should be obvious, but what may not be obvious is how it can be employed in writing templates. Within the bounds of our desire for ergonomics, we should design the interfaces of our templates to maximize the number of identical instantiations.

What’s in a name?

The following example, adapted from a real-world change to a large D project, yielded a reduction in compile time of a few percent for unit-test builds.

Let’s say we have an expensive template whose behavior we want to test over a simple type. Our type might be:

struct Vector3
{
float x;
float y;
float z;
}

To demonstrate the phenomenon, we don’t have to do anything fancy, so we’ll just declare a stub called send.

// Let's say this sends a value of type T to a database.
void send(T)(T x);

A note on syntax: Given a variable val of type int, this template could be explicitly instantiated as send!(int)(val). However, the compiler can infer the type T, so we can instantiate it as if it were a normal function call as send(val). Using D’s Uniform Function Call Syntax, we could alternatively call it like a property or member, as val.send() (the approach used in the following example), or even val.send, since parentheses are optional in function calls when there are no arguments.

Our test might then be something like:

struct Vector3
{
float x;
float y;
float z;
}

unittest
{
Vector3 value;
value.send();
}

This is reasonable so far. However, an issue arises when we start to write more than one test. Should we want to test different behaviors of a fancy template, but instantiate it with the same type, then we end up spending more time in compilation than we would have expected. A lot more time. And we see large growth in the number of symbols emitted in the resulting binary, resulting in a larger file size than one would expect. Why is that?

Despite our intuition that the compiler should consider multiple declarations of a type like Vector3 in multiple unittest blocks as identical, it actually does not. We can demonstrate this effect with an extremely simple example. We’ll provide an implementation of send that prints at compile time the type of each instantiation. Then we’ll use static foreach to generate five distinct implementations of a single unit test.

void send(T)(T x)
{
pragma(msg, T);
}

// Generate 5 unittest blocks
static foreach(_; 0..5)
{
unittest
{
struct JustInt
{
int x;
}
JustInt value;
value.send;
}
}

This results in the following output from the compiler:

JustInt
JustInt
JustInt
JustInt
JustInt

Huh? Doesn’t this violate our “you can’t pay twice” rule? If you were to take this output from the compiler as gospel, then yes, but there’s a more subtle truth here.

Fully qualified names

The name of a type as you would write it in your editor is not the complete name of a type. Let’s amend the implementation of send to print the return value of a template called fullyQualifiedName rather than printing T directly. The rest of the example remains the same.

void send(T)(T x)
{
import std.traits : fullyQualifiedName;
pragma(msg, fullyQualifiedName!T);
}

Assuming the module is named example, this yields something like:

example.__unittest_L13_C3_1.JustInt
example.__unittest_L13_C3_2.JustInt
example.__unittest_L13_C3_3.JustInt
example.__unittest_L13_C3_4.JustInt
example.__unittest_L13_C3_5.JustInt

This explains our previous conundrum. By declaring the type locally in each test, we have actually declared a new type per test, each of which results in a new instantiation.

A type’s fully qualified name includes the name of its enclosing scope ({package-name.}module-name.{scope-name(s).}TypeName). The compiler rewrites each unittest as a unique function with a generated name. We have five unique functions, each with its own local, distinct declaration of a JustInt type. And so we end up with five distinct types.

We want to ensure that one instantiation is reused across unittest blocks. We do that by moving the declaration of JustInt to module scope, outside of the unit tests.

struct JustInt
{
int x;
}

static foreach(_; 0..5)
{
unittest
{
JustInt value;
value.send;
}
}

The send template now prints:

example.JustInt

Much better.

Some hard data

To collect some anecdata about the usefulness of these changes, we’ll look at compilation times and the size of compiled binaries. Since this template is very trivial, let’s generate a hundred copies of the same unittest rather than five so we can see a trend.

On my system, timing the compilation of our programs shows the locally declared types took 243ms to compile, but the version with a single global type declaration took 159ms to compile. A difference of 84ms is not all that much, sure, but in a large codebase, there may be a lot of these speedups waiting to be found. Any reduction in compile times is to be embraced, especially when it’s cumulative.

As for binary size, I saw a savings of 69K on disk. The quantity of machine code generated by the compiler is worth keeping a close eye on. Larger binaries mean more work for the linker, which in turn means more time waiting for builds to complete. The easiest job is the one you don’t have to do.

A more complex example

The following example demonstrates a very simple but fundamental change to a template that yields an enormous improvement in compile times and other metrics.

Let’s say we have a fairly simple interpreter, and we want to expose functions in our D code to the scripts executed by our interpreter. We can do that with some sort of registration function, which we’ll call register.

The signature of the register function

To prove the point I’m discussing, we don’t need to implement this function—its interface is what can cause a big slow down.

Let’s say our register function looks like this:

// Context is something our hypothetical interpreter works with
void register(alias func, string registeredName)(Context x); 

It’s pretty reasonable, right? It takes a template alias parameter that specifies the function to call (a common idiom in D) and a template value parameter of type string that represents the name of the function as it is exposed to scripts. The implementation of register will presumably map the value of registeredName to the func alias, and then scripts can call the function using that value. Functions can be registered with, e.g., the following:

context = createAContext();
context.register!(writeln!string, "writeln")();

The scripts can call the Phobos writeln function template using the name writeln.

The compile-time performance of the register template

The interface for register looks harmless, but it turns out that it has a significant impact on compile time. We can test this by registering some random functions. The actual contents of the functions don’t matter—this article is about template compile times, so we just want a baseline figure for roughly how much time the infrastructure templates take to compile rather than the code they are hooking together.

Although we will pull the functions out of a hat, the thing that will drive our intuition is to realize that a small number of interfaces will likely be reused many times. We could start with a basket of interface stubs like this:

int stub1(string) { return 1; }

int stub2(string) { return 2; }

int stub3(string) { return 3; }

// etc.

More broadly, with a bunch of functions that have identical signatures and a bunch of functions with random parameter lists and return types, we can get a rough baseline. With the set of stubs I used, compile times ended up at roughly 5 seconds.

So what happens if we move the compile-time parameters to run time? Since registeredName is a template value parameter, we can just move it into the function parameter list with no change. We have to handle the func parameter differently. Almost any symbol can bind at compile time to a template alias parameter, but symbols can’t bind at run time to function parameters. We have to use a function pointer instead. In that case, we can use the type of the referenced function as a template parameter.

void register(FuncType)(Context x, FuncType ptr, string registeredName);

With this signature, the compile time drops to roughly 1 second.

What’s going on?

D is a fairly fast language to compile. Good decisions have been made over the lifetime of the language to make that possible. It is also the case that one can happily write slow-to-compile D code. Although we are choosing to ignore the compilation speeds of the non-infrastructure code to simplify the point being made, this can actually (in a certain sense) be the case in real projects, too. As such it is worthwhile to pay attention not to the quantity of metaprogramming being done semantically but rather the quantity of metaprogramming being performed by the compiler.

In this case, with the first interface used for register, the compiler had no opportunity to reuse any instantiations. Because it accepted the registered functions as symbols, each instantiation was unique. By shifting instead to take the type of the registered function as a template parameter and a pointer to the function as a function parameter, the compiler could reuse instantiations. stub1, stub2, and stub3 are distinct symbols, but they each have the same type of (int function(string)).

To be clear, this is not an indictment of template alias parameters. There are good use cases for them (the Phobos algorithms API is an example). The point of this example is to show how the compile-time costs of unique template instantiations can be hidden. A decision about the trade-offs between compile-time and run-time performance can only be made if the programmer is aware there is a decision to make. So when implementing a template, consider how it will be used. If it’s going to end up creating many unique instantiations, then you can weigh the benefits of keeping that interface versus redesigning it to maximize reuse.

A false friend

In linguistics, a false friend is a pair of words from two different languages that look the same but have different meanings. I’m going to abuse this term by using it to refer to a pair of programming patterns that actually result in the same program behavior, but via different routes through the language implementation, i.e., one of these patterns has, say, worse performance or compile times than the other.

A simple example:

Let’s say we have a library that exposes a template as part of its API, like this one that prints a string when a module is loaded during run-time initialization:

template FunTemplate(string op)
{
shared static this()
{
import std.stdio;
writeln(op);
}
}

// Use like this
mixin FunTemplate!"Hello from DLang";

Now let’s say we want to refactor the library in some way that it’s desirable to distinguish the name FunTemplate and its implementation. How would you go about doing that?

One way would be to tack Impl onto the implementation name, then declare an eponymous template that aliases the shortened name to the implementation name and forwards the template argument like this:

template FunTemplate(string op)
{
alias FunTemplate = FunTemplateImpl!op;
}

This does the job, but it also creates an additional instantiation for each different value of op, one instance of FunTemplate, and one of FunTemplateImpl. So if we instantiate it with, e.g., five different values, we end up with ten unique instantiations. Now imagine doing that with a template that’s heavily used throughout a program.

Since we only want to provide an alternate name for the implementation and aren’t doing anything to the parameter list, we can achieve the same result without adding another template into the mix: just use alias by itself.

alias FunTemplate = FunTemplateImpl;

Since FunTemplate is no longer a template, FunTemplate!"Foo" only creates the one instance of FunTemplateImpl.

Normalization of template arguments

Once we know what we want a template to look like, and we’re satisfied with the interface we want it to have, there are sometimes subtle ways to separate the interface and implementation of a template such that we can minimize the total amount of work the compiler has to do.

The definition of “work” in this context can be important to consider, as we can find ways to balance a tradeoff between compile times and the amount of object code generated for each instantiation. One technique to reduce these costs is by normalizing a given list of template arguments into something called a canonical form.

Canonical Forms

A canonical form, resulting from a process called canonicalization, is a mathematical structure that is intended to reduce multiple different-looking but identical objects into one form that we can then manipulate as we see fit. Using an automatic code formatter is an example of transforming input (in this case, source code) into a canonical form.

Application to templates

Consider a template like this one:

template Expensive(Args...)
{
/* Some kind of expensive metaprogramming or code generation */
}

If we can think of a useful canonical form that isn’t too hard to compute, we can then write a second template Reduce to implement it, then inject it like in the following example.

template Expensive(Args...)
{
// Reduce to some kind of canonical form
alias reduced = Reduce!(Args);

// Where ExpensiveImpl is the same as above but renamed
alias Expensive = ExpensiveImpl!(reduced);
}

To be worth doing, ExpensiveImpl must be significantly more expensive than the reduction operation (pay attention to differences in sys time when measuring this), where “significant” is meant statistically rather than informally, i.e., any win is good as long as you can rigorously prove it’s real.

An example: sorting template arguments

Take a templated aggregate like this:

struct ExposeMethods(Types...)
{
/* Some kind of internal state dependant on Types but not their order */

static foreach(Type; Types) {
bool test(Type x) {
/* Something slow to compile depending on Type */
}
}
}

If it’s instantiated with, e.g., five different types across a large codebase, we could spend a lot of time redoing semantically identical compilation. If all possible types are used as input many times, we could end up with a few permutations, and if not we will probably get a few identical subsets.

A canonical form that might come to mind (i.e., a potential definition of Reduce) is simply sorting the arguments by their names. This can be achieved via the use of staticSort.

Conclusion

D has powerful metaprogramming and code generation features. But like anything in programming, their use isn’t free. If you want to avoid the situation where you find yourself making coffee while your project builds, then it’s imperative to be aware of the cost vs. benefits of the metaprogramming features you use. Then you can make intelligent decisions about your compile-time interfaces and implementations.

Appendix – Tracing the D compiler to count template instantiations

Here’s a simple lesson in Linux userspace tracing: you can use a tool like bpftrace or DTrace to spy on the D compiler compiling other things, so we can get basic figures about the compilation of other D programs without either hacking the compiler or changing their build process.

You’ll need a bpftrace file like the following (saved as e.g., main.bt):

BEGIN
{
printf("Tracing a D file\n");
}
uprobe:/home/mhh/dlang/dmd-2.097.0/linux/bin64/dmd:_Dmain
{
printf("This is the main\n");
}
uprobe:/home/mhh/dlang/dmd-2.097.0/linux/bin64/dmd:_D3dmd10dsymbolsem24templateInstanceSemanticFCQBs9dtemplate16TemplateInstancePSQCz6dscope5ScopePSQDr4root5array__T5ArrayTCQEq10expression10ExpressionZQBkZv
{
//We do nothing with the knowledge here but if you write some code you can get info about the templates relatively easily
printf("Instantiating a template\n");
}

What am I hooking here? That big mangled name in the middle of the script is templateInstanceSemantic in the dsymbolsem module in the DMD source. By hooking it, we can get a rough idea of when a template is being worked on.

Running it with sudo bpftrace main.bt (eBPF tracing currently requires root) when building DMD, for example, I see there are about 50,800 template hits.

You can use a more complicated script in a system like bcc to reconstruct the compiler’s internal data structures. With that, we can get output a bit more like 09:05:17 69310 b'/home/mhh/d_dev/dmd/src/dmd/errors.d:85' and actually reconstruct the source/line info (alongside a timestamp and PID).

Teaching D from Scratch: Is it a viable first language?

About two-and-a-half years ago, one of my son’s homeschooling friends asked if I would teach a coding class. Both of my kids have shown interest in coding via Scratch, and I also need to pull my weight when it comes to homeschooling them (my wife doing most of the instruction). So I thought, surely, this can’t be too hard. The challenges I would encounter over the course of the last few years surprised me more often than I would have expected!

First, a bit about my background. I’ve been writing software professionally for over 20 years, about half using system languages and half doing web development. I’ve dabbled in a lot of programming projects, including embedded microcontrollers with 256 bytes of RAM, collaborative video systems using high-end Unix workstations, and automating a factory testing rack-mounted appliances. What I haven’t ever done to any official degree is teach programming. For sure, this is a challenge that is novel to me, and also one that I have found very satisfying.

A bit about the kids. I had ten students, aged 9 – 15, all of whom had very little experience writing real software. Since this class was going to be a side project for me on top of my normal job, we went rather slowly, meeting once every two weeks.

One of the goals of my class was to try not to dumb down coding into an abstraction. I wanted the kids to experience real programming without too much of a pre-built hand-holding framework, and I believed they could do it. I’ve seen many (and purchased some) online resources for teaching to code that hide most of the real work and just focus on using custom libraries or simplified languages to learn the “basics”, while not actually teaching to be productive in a standard environment. I also witnessed firsthand my own kids getting bored when the material was presented too slowly or too abstractly.

Note: you can see some of my homework assignments and review pages on a dedicated website that I created for this class

First Try: Javascript and Lua

The first language we worked with, because many of the kids didn’t have actual full-blown computers to work on (this was pre-Covid so it was at my house), was Javascript. Javascript is possible to develop in a browser, using a tablet or a Chromebook, so all the kids I was teaching could participate. My focus in these early days was basic concepts like loops, branching, and basic statements. I think Javascript can be a great first language, though the things you can do with it as a beginner are a bit bland, and to do anything exciting you have to start learning the DOM (including how objects and methods work).

After half a year we moved on to our second language: Lua, specifically in the video game system Roblox. I almost wish we had skipped this one because it was a bit too advanced for these students, and I was not familiar with the language and the system to begin with. The kids did have fun with the model-building UI. However, I can see an experienced teacher finding good ways to instruct using this environment. Roblox, with its pre-built rendering and physics engines, has the nice benefit of allowing one to generate a really cool game with relatively little code—something new coders quite enjoy.

Just use D

And that leads us to the second year, where I decided to just try teaching what I know best: the D programming language. Since all the kids have a keen interest in building games, I searched for a simple library I could use to build games. I have to give a shout-out to the library I discovered, Raylib, and the YouTuber Ki Rill, whose very straightforward and well-made videos gave me the confidence that this also would be something I could teach.

For a quick demo of what Raylib looks like, here is a program I wrote in a few minutes that bounces a Raylib-drawn D-man around the window.

Everyone in the class is using Windows, so the toolchain that I am having them use is the stock DMD compiler and the Visual Studio Code development environment with the code-d extension. Throughout the last year-and-a-half, I have had the kids build such classics as hangman and tic-tac-toe using text-based “graphics”. After that came the introduction to 2D graphics and Raylib, where we built a brick-breaking game similar to Arkanoid and others.

Finally, I let the students pick what they wanted for the last project, and their pick was a rock-paper-scissors online tournament game. I’m not sure of the lasting power of that one…

Keep in mind that my goal was to teach them programming, not necessarily D. Working through all these projects allowed me to introduce a lot of the typical programming concepts (branching, looping, functions, data types, arrays, etc.). This year we are focusing on aggregates (specifically arrays and structs), and how they can be useful to encapsulate your coding solution into pieces that are easier to deal with. The game we are writing this year is a snake clone. I’m hoping to wrap things up on that by the end of the year, and then next year add in networking to have the game playable between the students (all the kids these days are into online tournament-style games).

My Take on Teaching With D

So how is D as a beginner’s programming language? I can say with confidence that the kids understand the flavor of D that I have taught just as well as the other languages. What flavor is that you may ask? Why vanilla of course! As D was the first typed language they were exposed to, I had to first explain types and why they are important. I also had to pick and choose which concepts in D I did not want to confuse them with. To that end, I’ve left out a lot of the advanced features (so far), such as:

• Templates
• Type inference
• __traits
• Classes/Interfaces/OOP
• Pointers/references (mostly)
• Memory management in general (thank you GC!)

This means that a lot of language features are missing from “Vanilla D”, and a lot of those are pieces I like. I had to force myself to avoid those things while teaching so as not to confuse them. I also try to avoid shortcuts if I can, as I want them to master the basic concepts before learning how to do it quicker or less verbosely. I tried hard to always use curly braces for scope blocks, for instance, instead of just writing single statements without them.

The features that I found the kids handled well were basic types like ints and strings (though what a funny name “string” is, that took a while to sink in), if statements and foreach/while loops, structs, and functions. They had a harder time understanding arrays and associative arrays, which might be due to my lack of experience teaching. I’ll also note that teaching virtually has some significant challenges—there’s just no substitute to being able to walk over to a student’s computer and observe and help their progress or draw out on a whiteboard how data is laid out. One surprising thing (to me anyway) that they handled completely in stride, was nested functions. That’s a nice D feature that is not available in C, barely in C++, and something that I personally took a while to get used to.

What Could Go Wrong?

The challenges were numerous. First I had to try and remember what it’s like to know nothing about programming. I did not take any classes on teaching programming and did not look up the “proper” way to teach it, I just went with what I thought would be the next best thing to do based on their skills and experience. Having class only once every two weeks is a big drawback since the students’ brains tend to garbage-collect some of the stuff we worked on last time. For sure, a daily or maybe twice-weekly class schedule would improve the situation, but that would mean I’d have to generate material to teach that many classes, which I unfortunately don’t have the time for.

Secondly, I previously had no experience with games programming. Raylib made this pretty easy, and as long as you can understand event loops, it’s pretty straightforward (and fun—I wish I had learned it sooner).

Aside from my personal and environmental shortcomings, I have some suggestions for the D language and community that would make D a much better “first language”. In no particular order:

Error Messages

Error messages in D are sometimes esoteric, and I’m not just talking about templates. The most common error the students make is bad punctuation. This leads to some of the worst messages. Most of the time, I need to explain what is happening, but even when explaining it myself, I am sometimes thwarted by the compiler. For instance, if you forget a semicolon:

void main()
{
import std.stdio;
int x = 5
writeln(x); // line 5
}

The resulting error is:

foo.d(5): Error: semicolon expected, not writeln

But wait, there is a semicolon on line 5! So why is it having a problem? Technically, putting an extra semicolon just before the writeln call on line 5 would solve the error (after all, D doesn’t have significant whitespace). But really, the compiler shouldn’t be suggesting that.

I’d rather see something like:

foo.d(5): Error: previous statement not terminated, perhaps you need
a semicolon on line 4?

What if you are missing a brace or have an extra one? That spits out a lot of error vomit that is sometimes really difficult to decipher. I know this isn’t as easy a problem to solve, but since the compiler is not even to the tricky semantic parts, what if it used some other hints of the source to try and tease out a suggested solution? Like maybe taking cues from the indentation.

A great feature of the D compiler is the spelling correction suggestions. But sometimes it’s not easy to see what has changed in the suggested spelling correction (especially when it’s just different capitalization). Why not highlight the changed letters? Since D has gained colorized output for errors, things have become much clearer in my opinion.

I do have to relay a comment from one of my students, who said he appreciated the ability of VS Code to jump to the line that the compiler error is referencing. This helps put into perspective where their mindset is.

Debugger on VSCode for Windows

Another challenge I’ve faced is the lack of a good debugging experience for VS Code on Windows. The C debugger for VS Code displays, for example, a string as a length and a C string (which might have trailing garbage). I’d rather not confuse them with this. I have read that Visual D has a better debugging experience, but I need dub support for these projects, and Visual D focuses on Visual Studio integration, something I don’t necessarily want to deal with in teaching these kids. I’m hoping that a recent focus on debugging (e.g., the LLDB integration project) will help.

I think WebFreak has done a great job on the VS Code plugin, and for the most part, it works amazingly well. Thank you! I know debugging is a piece that is hard to get right.

My next challenge is using dub for linking. Dub is great if you are only doing things with D and OS-supplied libraries. As soon as you need to depend on an external C library that is not part of your OS (such as Raylib), now you are not just adding dependencies but downloading pre-built libraries, or trying to build them yourself. I’m not sure if I have great suggestions on this, but I would like some way for dub to be told where to download these libs or have a central place to put them where the linker can find them outside of specifying the location in your dub.json file. Many libraries are released on GitHub, which makes it a prime candidate for downloading and installing external libraries (maybe even when ImportC becomes viable, it can automatically generate the bindings as well).

I spend an inordinate amount of time helping the kids set up their build environments, and having this built-in to the IDE or dub would be immensely helpful for using D for learning.

Suggestions for the Future

I am confident that D can be a viable first language. I have students that, prior to my class, had never before written real code now creating 2D games using D and Raylib (with a lot of hand-holding, darn it!). What I think D needs for this success to be scalable is a set of resources that assumes no programming experience. Most of D’s learning resources are aimed at teaching experienced programmers how D works in terms of their existing knowledge. They answer the question, “How do I do <insert feature> from <insert language> in D?” What we need is to answer the question, “How do I write code?”, that just happens to be using D.

Editor’s Note:Some readers may be thinking at this point of Ali Çehreli’s Programming in D. Though written for beginner-level programmers, it’s focused more on teaching the language than teaching to program generally.

This shouldn’t be too hard to create, as there are a ton of resources on learning to program in C already. D is so similar you could almost just replace C with D and call it a day. In fact, I’m considering creating a YouTube series on the topic with the experience I’ve gained. One important tool that needs creating is a map of D features and where they fall on the difficulty scale. This allows one to write D tutorials that follow a gradual introduction of features. Broaching the subject of complex features is probably needed early, as they will experience some of them (you are using a template whenever you use writeln, or to), but explanation of the entire feature may be deferred to a later level.

And finally, such materials should be engaging! I can’t stress enough how important it is to be doing something fun when you are learning to program. The kids definitely are more engaged since we are creating games that they might actually want to play vs. simple hello world examples or even Javascript web pages. A nice way to proceed might be to write a game in Vanilla D and then introduce features that allow you to refine and improve the code.

I hope this becomes a reality because I think this area of development has been relegated to either frameworks that attempt to water down programming to get kids engaged or sterile hello world-style programming that is intended for rote instruction or reference. I think we can show the full potential of D and, at the same time, engage eager learners with interesting and fun projects to trick them into learning code with our favorite language!

Interfacing D with C: Strings Part One

This post is part of an ongoing series on working with both D and C in the same project. The previous two posts looked into interfacing D and C arrays. Here, we focus on a special kind of array: strings. Readers are advised to read Arrays Part One and Arrays Part Two before continuing with this one.

The same but different

D strings and C strings are both implemented as arrays of character types, but they have nothing more in common. Even that one similarity is only superficial. We’ve seen in previous blog posts that D arrays and C arrays are different under the hood: a C array is effectively a pointer to the first element of the array (or, in C parlance, C arrays decay to pointers, except when they don’t); a D dynamic array is a fat pointer, i.e., a length and pointer pair. A D array does not decay to a pointer, i.e., it cannot be implicitly assigned to a pointer or bound to a pointer parameter in an argument list. Example:

extern(C) void metamorphose(int* a, size_t len);

void main() {
int[] a = [8, 4, 30];
metamorphose(a, a.length);      // Error - a is not int*
metamorphose(a.ptr, a.length);  // Okay
}

Beyond that, we’ve got further incompatibilities:

• each of D’s three string types, string, wstring, and dstring, are encoded as Unicode: UTF-8, UTF-16, and UTF-32 respectively. The C char* can be encoded as UTF-8, but it isn’t required to be. Then there’s the C wchar_t*, which differs in bit size between implementations, never mind encoding.
• all of D’s string types are dynamic arrays with immutable contents, i.e., string is an alias to immutable(char)[]. C strings are mutable by default.
• the last character of every C string is required to be the NUL character (the escape character \0, which is encoded as 0 in most character sets); D strings are not required to be NUL-terminated.

It may appear at first blush as if passing D and C strings back and forth can be a major headache. In practice, that isn’t the case at all. In this and subsequent posts, we’ll see how easy it can be. In this post, we start by looking at how we can deal with NUL termination and wrap up by digging deeper into the related topic of how string literals are stored in memory.

NUL termination

Let’s get this out of the way first: when passing a D string to C, the programmer must ensure it is terminated with \0. std.string.toStringz, a simple utility function in the D standard library (Phobos), can be employed for this:

import core.stdc.stdio : puts;
import std.string : toStringz;

void main() {
string s0 = "Hello C ";
string s1 = s0 ~ "from D!";
puts(s1.toStringz());
}

toStringz takes a single argument of type const(char)[] and returns immutable(char)* (there’s more about const vs. immutable in Part Two). The form s1.toStringz, known as UFCS (Uniform Function Call Syntax), is lowered by the compiler into toStringz(s1).

toStringz is the idiomatic approach, but it’s also possible to append "\0" manually. In that case, puts can be supplied with the string’s pointer directly:

import core.stdc.stdio : puts;

void main() {
string s0 = "Hello C ";
string s1 = s0 ~ "from D!" ~ "\0";
puts(s1.ptr);
}

Forgetting to use .ptr will result in a compilation error, but forget to append the "\0" and who knows when someone will catch it (possibly after a crash in production and one of those marathon debugging sessions which can make some programmers wish they had never heard of programming). So prefer toStringz to avoid such headaches.

However, because strings in D are immutable, toStringz does allocate memory from the GC heap. The same is true when manually appending "\0" with the append operator. If there’s a requirement to avoid garbage collection at the point where the C function is called, e.g., in a @nogc function or when -betterC is enabled, it will have to be done in the same manner as in C, e.g., by allocating/reallocating space with malloc/realloc (or some other allocator) and copying the NUL terminator. (Also note that, in some situations, passing pointers to GC-managed memory from D to C can result in unintended consequences. We’ll dig into what that means, and how to avoid it, in Part Two.)

None of this applies when we’re dealing directly with string literals, as they get a bit of special treatment from the compiler that makes puts("Hello D from C!".toStringz) redundant. Let’s see why.

String literals in D are special

D programmers very often find themselves passing string literals to C functions. Walter Bright recognized early on how common this would be and decided that it needed to be just as seamless in D as it is in C. So he implemented string literals in a way that mitigates the two major incompatibilities that arise from NUL terminators and differences in array internals:

1. D string literals are implicitly NUL-terminated.
2. D string literals are implicitly convertible to const(char)*.

These two features may seem minor, but they are quite major in terms of convenience. That’s why I didn’t pass a literal to puts in the toStringz example. With a literal, it would look like this:

import core.stdc.stdio : puts;

void main() {
puts("Hello C from D!");
}

No need for toStringz. No need for manual NUL termination or .ptr. It just works.

I want to emphasize that this only applies to string literals (of type string, wstring, and dstring) and not to string variables; once a string literal is included in an expression, the NUL-termination guarantee goes out the window. Also, no other array literal type is implicitly convertible to a pointer, so the .ptr property must be used to bind them to a pointer function parameter, e.g., giveMeIntPointer([1, 2, 3].ptr).

But there is a little more to this story.

String literals in memory

Normal array literals will usually trigger a GC allocation (unless the compiler can elide the allocation, such as when assigning the literal to a static array). Let’s do a bit of digging to see what happens with a D string literal:

import std.stdio;

void main() {
writeln("Where am I?");
}

To make use of a command-line tool particularly convenient for this example, I compiled the above on 64-bit Linux with all three major compilers using the following command lines:

dmd -ofdmd-memloc memloc.d
gdc -o gdc-memloc memloc.d
ldc2 -ofldc-memloc memloc.d

If we were compiling C or C++, we could expect to find string literals in the read-only data segment, .rodata, of the binary. So let’s look there via the readelf command, which allows us to extract specific data from binaries in the elf object file format, to see if the same thing happens with D. The following is abbreviated output for each binary:

readelf -x .rodata ./dmd-memloc | less
Hex dump of section '.rodata':
0x0008e000 01000200 00000000 00000000 00000000 ................
0x0008e010 04100000 00000000 6d656d6c 6f630000 ........memloc..
0x0008e020 57686572 6520616d 20493f00 2f757372 Where am I?./usr
0x0008e030 2f696e63 6c756465 2f646d64 2f70686f /include/dmd/pho
...

readelf -x .rodata ./gdc-memloc | less
Hex dump of section '.rodata':
0x00003000 01000200 00000000 57686572 6520616d ........Where am
0x00003010 20493f00 00000000 2f757372 2f6c6962  I?...../usr/lib
...

readelf -x .rodata ./ldc-memloc | less
Hex dump of section '.rodata':
0x00001e40 57686572 6520616d 20493f00 00000000 Where am I?.....
0x00001e50 2f757372 2f6c6962 2f6c6463 2f783836 /usr/lib/ldc/x86

In all three cases, the string is right there in the read-only data segment. The D spec explicitly avoids specifying where a string literal will be stored, but in practice, we can bank on the following: it might be in the binary’s read-only segment, or it might be in the normal data segment, but it won’t trigger a GC allocation, and it won’t be allocated on the stack.

Wherever it is, there’s a positive consequence that we can sometimes take advantage of. Notice in the readelf output that there is a dot (.) immediately following the question mark at the end of each string. That represents the NUL terminator. It is not counted in the string’s .length (so "Where am I?".length is 11 and not 12), but it’s still there. So when we initialize a string variable with a string literal or assign a string literal to a variable, the lack of an allocation also means there’s no copying, which in turn means the variable is pointing to the literal’s location in memory. And that means we can safely do this:

import core.stdc.stdio: puts;

void main() {
string s = "I'm NUL-terminated.";
puts(s.ptr);
s = "And so am I.";
puts(s.ptr);
}

If you’ve read the GC series on this blog, you are aware that the GC can only have a chance to run a collection if an attempt is made to allocate from the GC heap. More allocations mean a higher chance to trigger a collection and more memory that needs to be scanned when a collection runs. Many applications may never notice, but it’s a good policy to avoid GC allocations when it’s easy to do so. The above is a good example of just that: toStringz allocates, we don’t need it in either call to puts because we can trust that s is NUL-terminated, so we don’t use it.

To be very clear: this is only true for string variables that have been directly initialized with a string literal or assigned one. If the value of the variable was the result of any other operation, then it cannot be considered NUL-terminated. Examples:

string s1 = s ~ "...I'm Unreliable!!";
string s2 = s ~ s1;
string s3 = format("I'm %s!!", "Unreliable");

None of these strings can be considered NUL-terminated. Each case will trigger a GC allocation. The runtime pays no mind to the NUL terminator of any of the literals during the append operations or in the format function, so the programmer can’t trust it will be part of the result. Pass any one of these strings to C without first terminating it and trouble will eventually come knocking.

But hold on…

Given that you’re reading a D blog, you’re probably adventurous or like experimenting. That may lead you to discover another case that looks reliable:

import core.stdc.stdio: puts;

void main() {
string s = "Am I " ~ "reliable?";
puts(s.ptr);
}

The above very much looks like appending multiple string literals in an initialization or assignment is just as reliable as using a single string literal. We can strengthen that assumption with the following:

import std.stdio : writeln;

void main() {
writeln("Am I reliable?".ptr);

string s = "Am I " ~ "reliable?";
writeln(s.ptr);
}

writeln is a templated function that recognizes when it’s being given a pointer; rather than treating it as a string and printing what it points to, it prints the pointer’s value. So we can print memory addresses in D without a format string.

Compiling the above, again on 64-bit Linux:

dmd -ofdmd-rely rely.d
gdc -o gdc-rely rely.d
ldc2 -ofldc-rely rely.d

Now let’s execute them all:

./dmd-rely
562363F63010
562363F63030

./gdc-rely
5566145E0008
5566145E0008

./ldc-rely
55C63CFB461C
55C63CFB461C

We see that dmd-rely prints two different addresses, but they’re very close together. Both gdc-rely and ldc-rely print a single address in both cases. And if we make use of readelf as we did with the memloc example above, we’ll find that, in every case, the literals are in the read-only data segment. Case closed!

Well, not so fast.

What’s happening is that all three compilers are performing an optimization known as constant folding. In short, they can recognize when all operands of an append expression are compile-time constants, so they can perform the append at compile-time to produce a single string literal. In this case, the net effect is the same as s = "Am I reliable?". LDC and GDC go further and recognize that the resulting literal is identical to the one used earlier, so they reuse the existing literal’s address (a.k.a. string interning). (Note that DMD also performs string interning, but currently it only kicks in when a string literal appears more than twice.)

To be clear: this only works because all of the operands are string literals. No matter how many string literals are involved in an operation, if only one operand is a variable, then the operation triggers a GC allocation.

Although we see that the result of an append operation involving string literals can be passed directly to C just fine, and we’ve proven that it’s stored in read-only memory alongside its NUL terminator, this is not something we should consider reliable. It’s an optimization that no compiler is required to perform. Though it’s unlikely that any of the three major D compilers will suddenly stop constant folding string literals, a future D compiler could possibly be released without this particular optimization and instead trigger a GC allocation.

In short: rely on this at your own risk.

Addendum: Compile rely.d on Windows with dmd and the binary will yield some very different output:

dmd -m64 -ofwin-rely.exe rely.d
./win-rely
7FF76701D440
7FF76702BB30

There is a much bigger difference in the memory addresses here than in the dmd binary on Linux. We’re dealing with the PE/COFF format in this case, and I’m not familiar with anything similar to readelf for that format on Windows. But I do know a little something about Abner Fog’s objconv utility. Not only does it convert between object file formats, but it can also disassemble them:

objconv -fasm win-rely.obj

This produces a file, win-rely.asm. Open it in a text editor and search for a portion of the string, e.g., "I rel". You’ll find the two entries aren’t too far apart, but one is located in a block of text under this heading:

rdata SEGMENT PARA ‘CONST’ ; section number 4

And the other under this heading:

.data$B SEGMENT PARA ‘DATA’ ; section number 6 In other words, one of them is in the read-only data segment (rdata SEGMENT PARA 'CONST'), and the other is in the regular data segment. This goes back to what I mentioned earlier about the D spec being explicitly silent on where string literals are stored. Regardless, the behavior of the program on Windows is the same as it is on Linux; the second call to puts doesn’t blow anything up because the NUL terminator is still there, one slot past the last character. But it doesn’t change the fact that constant folding of appended string literals is an optimization and only to be relied upon at your own risk. Conclusion This post provides all that’s needed for many of the use cases encountered with strings when interacting with C from D, but it’s not the complete picture. In Part Two, we’ll look at how mutability, immutability, and constness come into the picture, how to avoid a potential problem spot that can arise when passing GC-allocated D strings to C, and how to get D strings from C strings. We’ll save encoding for Part Three. Thanks to Walter Bright, Ali Çehreli, and Iain Buclaw for their valuable feedback on this article. Symphony of Destruction: Structs, Classes and the GC (Part One) This post is part of a broader series on garbage collection in D. The motivation is to explore how destructors and the GC interact. To do that, we first need a bit of background. We do not go into a broader discussion on the ins and outs of object destruction, only what is most relevant to the interaction of destructors and the GC. I’ve split the discussion into two blog posts. Here in Part One, we look at how deterministic and non-deterministic destruction differ, consider the consequences of having a single destructor for both scenarios, and finally establish two simple guidelines that will help us avoid those consequences. In Part Two, we’ll go further and explore how we can still write solid destructors when circumstances dictate that the guidelines don’t apply. Deterministic destruction Destruction is deterministic when it is predictable, meaning the programmer can, simply by following the flow of the code, point out where and when an object’s destructor is invoked. This is possible with struct instances allocated on the stack, as the compiler will insert calls to their destructors at well-defined points for automatic and deterministic destruction. There are two basic rules for automatic destruction: 1. The destructors of all stack-allocated structs in a given scope are invoked when the scope exits. 2. Destructors are invoked in reverse lexical order (i.e., the opposite of the order in which the declarations appear in the source). With these two rules in mind, we can examine the following example and accurately predict its output. import std.stdio; struct Predictable { int number; this(int n) { writeln("Constructor #", n); number = n; } ~this() { writeln("Destructor #", number); } } void main() { Predictable s0 = Predictable(0); { Predictable s1 = Predictable(1); } Predictable s2 = Predictable(2); } We see that both s0 and s2 are directly within the scope of the main function, so their destructors will run when main exits. Given that the declaration of s2 comes after that of s0, the destructor of s2 will run before that of s0. We also see that s1 is declared in an anonymous inner scope between the declarations of s0 and s2. This scope exits before s2 is constructed, so the destructor of s1 will execute before the constructor of s2. With that, we can expect the following output: Constructor #0 // declaration of s0 Constructor #1 // declaration of s1 Destructor #1 // anonymous scope exits, s1 destroyed Constructor #2 // declaration of s2 Destructor #2 // main exits, s2 then s0 destroyed Destructor #0 Compiling and executing the example proves us accurate seers. The programmer can implement deterministic destruction manually, as is necessary when destroying instances allocated on the non-GC heap, e.g., with malloc or std.experimental.allocator. In an earlier post, Go Your Own Way (Part Two: The Heap), I covered how to use std.conv.emplace to allocate instances on the non-GC heap and briefly mentioned that destructors can be invoked manually via destroy. That’s a function template declared in the automatically imported object module so that it’s always available. We won’t retread the allocation discussion, but an example of manual destruction isn’t out of bounds for this post. In the following example, we’ll reuse the definition of the Predictable struct and a destroyPredictable function to manually invoke the destructors. For completeness, I’ve included functions for allocating and deallocating Predictable instances from the non-GC heap: allocatePredictable and deallocatePredictable. If it isn’t clear to you what these two functions are doing, please read the blog post I mentioned above. void main() { Predictable* s0 = allocatePredictable(0); scope(exit) { destroyPredictable(s0); } { Predictable* s1 = allocatePredictable(1); scope(exit) { destroyPredictable(s1); } } Predictable* s2 = allocatePredictable(2); scope(exit) { destroyPredictable(s2); } } void destroyPredictable(Predictable* p) { if(p) { destroy(*p); deallocatePredictable(p); } } Predictable* allocatePredictable(int n) { import core.stdc.stdlib : malloc; import std.conv : emplace; auto p = cast(Predictable*)malloc(Predictable.sizeof); return emplace!Predictable(p, n); } void deallocatePredictable(Predictable* p) { import core.stdc.stdlib : free; free(p); } Running this program will result in precisely the same output as the previous example. In the destroyPredictable function, we dereference the struct pointer when calling destroy because there is no overload that takes a pointer. There are specializations for classes, interfaces, and structs passed by reference and a general catch-all that takes all other types by reference. Destructors are invoked on types that have them. Before exiting, the function sets the argument to its default .init value through the reference. Note that if we were to give destroy a pointer without first dereferencing it, the code would still compile. The pointer would be accepted by reference and simply set to null, the default .init value for pointers, but the struct’s destructor would not be invoked (i.e., the pointer is “destroyed”, not the struct instance). Inserting writeln(*p) immediately after destroy(*p) should print Predictable(0) for each destroyed instance. (The default .init state for a struct in D is the aggregate of the .init property of each of its members; in this case, the sole member, being of type int, has an .init property of 0, so the struct’s default .init state is Predictable(0). This can be changed in the struct definition, e.g., struct Predictable { int id = 1; }.) destroy is not restricted to instances allocated on the non-GC heap. Any aggregate type instance (struct, class, or interface) is a valid argument no matter where it was allocated. Non-deterministic destruction In languages with support for objects and a garbage collector, the responsibility for destroying object instances allocated on the GC heap falls to the GC. This is known as finalization. Before reclaiming an object’s memory, the GC finalizes the object by invoking its finalizer. Finalization, though convenient, comes with a price. In Java’s particular circumstances, the price was determined to be so high that its maintainers deprecated the Object.finalize method and left a scary warning about its use in the documentation. It’s worth quoting here: The finalization mechanism is inherently problematic. Finalization can lead to performance issues, deadlocks, and hangs. Errors in finalizers can lead to resource leaks; there is no way to cancel finalization if it is no longer necessary; and no ordering is specified among calls to finalize methods of different objects. Furthermore, there are no guarantees regarding the timing of finalization. The finalize method might be called on a finalizable object only after an indefinite delay, if at all. Finalization in D isn’t quite the bugbear it is in Java, but we do see a less dramatic warning about it in the D documentation: The garbage collector is not guaranteed to run the destructor for all unreferenced objects. Furthermore, the order in which the garbage collector calls destructors for unreferenced objects is not specified. Although there’s no mention of “finalization” or “finalizers” here, that’s precisely what the text is referring to. The core message is the same in both warnings: finalization is non-deterministic and cannot be relied on. Unlike structs, classes in D are reference types by default. Some consequences: the programmer never has direct access to the underlying class instance; instances declared uninitialized are null by default; the normal use case is to allocate instances via new. When a class is instantiated in D, it is usually going to be managed by the GC and its destructor will serve as a finalizer. As an experiment, let’s change the definition of struct Predictable in our first example to class Unpredictable and use new to allocate the instances like so: import std.stdio; class Unpredictable { int number; this(int n) { writeln("Constructor #", n); number = n; } ~this() { writeln("Destructor #", number); } } void main() { Unpredictable s0 = new Unpredictable(0); { Unpredictable s1 = new Unpredictable(1); } Unpredictable s2 = new Unpredictable(2); } We’ll see that the output is drastically different: Constructor #0 Constructor #1 Constructor #2 Destructor #0 Destructor #1 Destructor #2 Anyone familiar with the characteristics of the default DRuntime constructor can predict for this very simple program that all the destructors will be run when the GC’s cleanup function is executed as the D runtime shuts down, and that they will be executed in the order in which they were declared (an implementation detail; and note that destruction at shut down can be disabled via a command line argument). But in a more complex program, this ability to predict breaks down. Destructors can be invoked by the GC at almost any time and in any order. To be clear, the GC will only perform its cleanup duties if and when it finds more memory is needed to fulfill a specific allocation request. In other words, it isn’t constantly running in the background, marking objects unreachable and calling destructors willy nilly. To that extent, we can predict when the GC has the possibility to perform its duties. Beyond that, all bets are off. We cannot predict with accuracy if any destructors will be invoked during any given allocation request or the order in which they will be invoked. This uncertainty has ramifications for how one implements destructors for any GC-managed type. For starters, destructors of GC-managed objects should never perform any operation that can potentially result in a GC allocation request. Attempting to do so can result in an InvalidMemoryOperationError at run time. I use the word “potentially” because some operations can indirectly cause the error in certain circumstances, but not in others. Some examples: attempting to index an associative array can trigger an attempt to allocate a RangeError if the key is not present; a failed assert will result in allocation of an AssertError; calling any function not annotated with @nogc means GC operations are always possible in the call stack. These and any such operations should be avoided in the destructors of GC-managed objects. (The first seven items on the list of operations disallowed in @nogc functions are collectively a good guide.) A larger issue is that one cannot rely on any resource still being valid when a destructor is called by the GC. Consider a class that attempts to close a socket handle in its destructor; it’s quite possible that the destructor won’t be called until after the program has already shutdown the network interface. There is no scenario in which the runtime can catch this. In the best case, such circumstances will result in a silent failure, but they could also result in crashes during program shutdown or even sooner. What it comes down to is that GC-allocated objects should never be used to manage any resource that depends upon deterministic destruction for cleanup. Designing for destruction For the D neophyte, it can appear as if destructors in D are useless. Given that both struct and class instances can be allocated from memory that may or may not be managed by the GC, that destructors of GC-managed objects are not guaranteed to run, and that destructors are forbidden to perform GC operations during finalization, how can we ever rely on them? In practice, it’s not as bad as it may seem. Issues do arise for the unwary, but armed with a basic awareness of the nature of D destructors, it turns out that it’s pretty easy to avoid having problems. This is especially true if the programmer adopts two fairly simple rules. 1. Pretend class destructors don’t exist Class instances will nearly always be allocated with new. That means their destructors will nearly always be non-deterministic. Much of what one would want to do in a destructor is somehow dependent on program state: either the destructor itself expects a certain state (like writing to a log file that is expected to be open), or the program expects the destructor to have modified the program state in a specific manner (like releasing a resource handle). Non-deterministic destruction means that all expectations about program state are thrown out the window. That log file might have already been closed, so the message will never be written (I hope it wasn’t important). That system resource handle may never be released until the program ends (I hope that particular resource isn’t scarce). Even if it seems through testing that a class destructor is working the way it’s intended, it’s quite likely down to the fact that the testing has not uncovered the case where it breaks. In a long-running program, that case will inevitably pop up at some point. Have fun debugging when your production game server starts randomly crashing. So when using classes in D, pretend they have no destructors. Pretend that they are Java classes with a deprecated finalize method. 2. Don’t allocate structs on the GC heap when they have destructors Since we’re pretending classes have no destructors, then we’re going to turn to structs for all of our destructive needs. Allocating structs as value objects on the stack will cover many use cases, but sometimes we may need to allocate them on the heap. When that situation arises, do not allocate any destructor-bearing struct with new. If we allocate a struct that has a destructor on the GC heap, it completely defeats our purpose of avoiding class destructors in the first place. That destructor we intended to be deterministic is now non-deterministic, so we may as well have just used a class. As we have seen, struct instances can be allocated on the non-GC heap (e.g., with malloc) and their destructors manually invoked with the destroy function. If we need deterministic destruction and we absolutely must have a heap-allocated struct, then we cannot allocate that struct on the GC heap. Guidelines schmuidelines I’m sure someone is reading the above guidelines and thinking, “If I have to pretend that classes have no destructors, then why do classes have destructors?” Well, you don’t have to. There is no One True Path to follow when deciding if an object should be implemented as a class or struct. Personally, I will always prefer structs over classes, and I will only reach for a class if I need something structs can’t give me easily (like a hierarchy) or efficiently. Other people will consider if the object they need to represent has an identity, e.g., an Actor in a simulation versus the Vertex that defines its 3D coordinates. POD (Plain Old Data) types should always be structs, but beyond that it’s largely a matter of preference. My two guidelines are based on my experience and that of others with whom I’ve spoken. They are intended to help you keep the full implications of D’s distinction between classes and structs at the forefront of your thoughts when architecting your program. They are not commandments that every D programmer must follow. Realistically, most D programmers will encounter circumstances at one time or another in which the guidelines do not apply. For example, when mixing GC-managed memory and manually-managed memory in the same program, it’s quite possible for a struct intended for stack use to wind up on the GC heap if the programmer is unaware of the circumstances. And some D programmers will always prefer classes over structs because that’s just the way they want it, and so will simply choose to ignore the guidelines. That’s no problem as long as they fully understand the consequences. So what does that mean? How do you get over the non-deterministic nature of class destructors if your Actor class absolutely must have a destructor, or if you prefer to always follow The Way of the Class? How do you prevent structs intended for the stack from being GC-allocated? These are things we’ll be looking at in Part Two. See you there. Thanks to Ali Çehreli and Max Haughton for their feedback. And to Adam D. Ruppe for his conversation on the topic in Discord and the title suggestion (it fits in more nicely with the series than the ‘Appetite For Destruction’ I had intended to go with) Function Generation in D: The Good, the Bad, the Ugly, and the Bolt Introduction A while ago, Andrei Alexandrescu started a thread in the D Programming Language forums, titled “Perfect forwarding”, about a challenge which came up during the July 2020 beerconf: Write an idiomatic template forward that takes an alias fun and defines (generates) one overload for each overload of fun. Several people proposed solutions. In the course of the discussion, it arose that there is sometimes a need to alter a function’s properties, like adding, removing, or hanging user-defined attributes or parameter storage classes. This was exactly the problem I struggled with while trying to support all the bells and whistles of D functions in my openmethods library. Eventually, I created a module that helps with this problem and contributed it to the bolts meta-programming library. But before we get to that, let’s first take a closer look at function generation in D. The Good I speculate that many a programmer who is a moderately advanced beginner in D would quickly come up with a mostly correct solution to the “Perfect forwarding” challenge, especially those with a C++ background who have an interest in performing all sorts of magic tricks by means of template meta-programming. The solution will probably look like this: template forward(alias fun) { import std.traits: Parameters; static foreach ( ovl; __traits(getOverloads, __traits(parent, fun), __traits(identifier, fun))) { auto forward(Parameters!ovl args) { return ovl(args); } } } ... int plus(int a, int b) { return a + b; } string plus(string a, string b) { return a ~ b; } assert(forward!plus(1, 2) == 3); // pass assert(forward!plus("a", "b") == "ab"); // pass This solution is not perfect, as we shall see, but it is not far off either. It covers many cases, including some that a beginner may not even be aware of. For example, forward handles the following function without dropping function attributes or parameter storage classes: class Matrix { ... } Matrix times(scope const Matrix a, scope const Matrix b) pure @safe { return ...; } pragma(msg, typeof(times)); // pure @safe Matrix(scope const(Matrix) a, scope const(Matrix) b) pragma(msg, typeof(forward!times)); // pure @safe Matrix(scope const(Matrix) _param_0, scope const(Matrix) _param_1) It even handles user-defined attributes (UDAs) on parameters: struct testParameter; void testPlus(@testParameter int a, @testParameter int b); pragma(msg, typeof(testPlus)); // void(@(testParameter) int a, @(testParameter) int b) pragma(msg, typeof(forward!testPlus)); // void(@(testParameter) int a, @(testParameter) int b) Speaking of UDAs, that’s one of the issues with the solution above: it doesn’t carry function UDAs. It also doesn’t work with functions that return a reference. Both issues are easy to fix: template forward(alias fun) { import std.traits: Parameters; static foreach (ovl; __traits(getOverloads, __traits(parent, fun), __traits(identifier, fun))) { @(__traits(getAttributes, fun)) // copy function UDAs auto ref forward(Parameters!ovl args) { return ovl(args); } } } This solution is still not 100% correct though. If the forwardee is @trusted, the forwarder will be @safe: @trusted void useSysCall() { ... } pragma(msg, typeof(&useSysCall)); // void function() @trusted pragma(msg, typeof(&forward!useSysCall)); // void function() @safe This happens because the body of the forwarder consists of a single statement calling the useSysCall function. Since calling a trusted function is safe, the forwarder is automatically deemed safe by the compiler. The Bad However, Andrei’s challenge was not exactly what we discussed in the previous section. It came with a bit of pseudocode that suggested the template should not be eponymous. In other words, I believe that the exact task was to write a template that would be used like this: forward!fun.fun(...). Here is the pseudocode: // the instantiation of forward!myfun would be (stylized): template forward!myfun { void myfun(int a, ref double b, out string c) { return myfun(a, b, c); } int myfun(in string a, inout double b) { return myfun(a, b); } } Though this looks like a small difference, if we want to implement exactly this, a complication arises. In the eponymous forward, we did not need to create a new identifier; we simply used the template name as the function name. Thus, the function name was fixed. Now we need to create a function with a name that depends on the forwardee’s name. And the only way to do this is with a string mixin. The first time I had to do this, I tried the following: template forward(alias fun) { import std.format : format; import std.traits: Parameters; enum name = __traits(identifier, fun); static foreach (ovl; __traits(getOverloads, __traits(parent, fun), name)) { @(__traits(getAttributes, fun)) auto ref mixin(name)(Parameters!ovl args) { return ovl(args); } } } This doesn’t work because a string mixin can only be used to create expressions or statements. Therefore, the solution is to simply expand the mixin to encompass the entire function definition. The token-quote operator q{} is very handy for this: template forward(alias fun) { import std.format : format; import std.traits: Parameters; enum name = __traits(identifier, fun); static foreach (ovl; __traits(getOverloads, __traits(parent, fun), name)) { mixin(q{ @(__traits(getAttributes, fun)) auto ref %s(Parameters!ovl args) { return ovl(args); } }.format(name)); } } Though string mixins are powerful, they are essentially C macros. For many D programmers, resorting to a string mixin can feel like a defeat. Let us now move on to a similar, yet significantly more difficult, challenge: Write a class template that mocks an interface. For example: interface JsonSerializable { string asJson() const; } void main() { auto mock = new Mock!JsonSerializable(); } Extrapolating the techniques acquired during the previous challenge, a beginner would probably try this first: class Mock(alias Interface) : Interface { import std.format : format; import std.traits: Parameters; static foreach (member; __traits(allMembers, Interface)) { static foreach (fun; __traits(getOverloads, Interface, member)) { mixin(q{ @(__traits(getAttributes, fun)) auto ref %s(Parameters!fun args) { // record call static if (!is(ReturnType!fun == void)) { return ReturnType!fun.init; } } }.format(member)); } } } Alas, this fails to compile, throwing errors like: Error: function challenge.Mock!(JsonSerializable).Mock.asJson return type inference is not supported if may override base class function In other words, auto cannot be used here. We have to fall back to explicitly specifying the return type: class Mock(alias Interface) : Interface { import std.format : format; import std.traits: Parameters, ReturnType; static foreach (member; __traits(allMembers, Interface)) { static foreach (fun; __traits(getOverloads, Interface, member)) { mixin(q{ @(__traits(getAttributes, fun)) ReturnType!fun %s(Parameters!fun args) { // record call static if (!is(ReturnType!fun == void)) { return ReturnType!fun.init; } } }.format(member)); } } } This will not handle ref functions though. What about adding a ref in front of the return type, like we did in the first challenge? // as before ref ReturnType!fun %s(Parameters!fun args) ... This will fail with all the functions in the interface that do not return a reference. The reason why everything worked almost magically in the first challenge is that we called the wrapped function inside the template. It enabled the compiler to deduce almost all of the characteristics of the original function and copy them to the forwarder function. But we have no model to copy from here. The compiler will copy some of the aspects of the function (pure, @safe, etc.) to match those of the overriden function, but not some others (ref, const, and the other modifiers). Then, there is the issue of the function modifiers: const, immutable, shared, and static. These are yet another category of function “aspects”. At this point, there is no other option than to analyze some of the function attributes by means of traits, and convert them to a string to be injected in the string mixin:  mixin(q{ @(__traits(getAttributes, fun)) %sReturnType!fun %s(Parameters!fun args) { // record call static if (!is(ReturnType!fun == void)) { return ReturnType!fun.init; } } }.format( (functionAttributes!fun & FunctionAttribute.const_ ? "const " : "") ~ (functionAttributes!fun & FunctionAttribute.ref_ ? "ref " : "") ~ ..., member)); } If you look at the implementation of std.typecons.wrap, you will see that part of the code deals with synthesizing bits of a string mixin for the storage classes and modifiers. The Ugly So far, we have looked at the function storage classes, modifiers, and UDAs, but we have merely passed the parameter list as a single, monolithic block. However, sometimes we need to perform adjustments to the parameter list of the generated function. This may seem far-fetched, but it does happen. I encountered this problem in my openmethods library. During the “Perfect forwarding” discussion, it appeared that I was not the only one who wanted to do this. I won’t delve into the details of openmethods here (see an older blog post for an overview of the module); for the purpose of this article, it suffices to say that, given a function declaration like this one: Matrix times(virtual!Matrix a, double b); openmethods generates this function: Matrix dispatcher(Matrix a, double b) { return resolve(a)(a, b); } The virtual template is a marker; it indicates which parameters should be taken into account (i.e., passed to resolve) when picking the appropriate specialization of times. Note that only a is passed to the resolve function—that is because the first parameter uses the virtual! marker and the second does not. Bear in mind, though, that dispatcher is not allowed to use the type of the parameters directly. Inside the openmethods module, there is no Matrix type. Thus, when openmethods is handed a function declaration, it needs to synthesize a dispatcher function that refers to the declaration’s parameter types exclusively via the declaration. In other words, it needs to use the ReturnType and Parameters templates from std.traits to extract the types involved in the declaration – just like we did in the examples above. Let’s put aside function attributes and UDAs – we already discussed those in the previous section. The obvious solution then seems to be: ReturnType!times dispatcher( RemoveVirtual!(Parameters!times[0]) a, Parameters!times[1] b) { return resolve(a)(a, b); } pragma(msg, typeof(&dispatcher)); // Matrix function(Matrix, double) where RemoveVirtual is a simple template that peels off the virtual! marker from the type. Does this preserve parameter storage classes and UDAs? Unfortunately, it does not: @nogc void scale(ref virtual!Matrix m, lazy double by); @nogc ReturnType!scale dispatcher(RemoveVirtual!(Parameters!scale[0]) a, Parameters!scale[1] b) { return resolve(a)(a, b); } pragma(msg, typeof(&dispatcher)); // void function(Matrix a, double b) We lost the ref on the first parameter and the lazy on the second. What happened to them? The culprit is Parameters. This template is a wrapper around an obscure feature of the is operator used in conjunction with the __parameters type specialization. And it is quite a strange beast. We used it above to copy the parameter list of a function, as a whole, to another function, and it worked perfectly. The problem is what happens when you try to process the parameters separately. Let’s look at a few examples: pragma(msg, Parameters!scale.stringof); // (ref virtual!(Matrix), lazy double) pragma(msg, Parameters!scale[0].stringof); // virtual!(Matrix) pragma(msg, Parameters!scale[1].stringof); // double We see that accessing a parameter individually returns the type… and discards everything else! There is actually a way to extract everything about a single parameter: use a slice instead of an element of the paramneter pack (yes, this is getting strange): pragma(msg, Parameters!scale[0..1].stringof); // (ref virtual!(Matrix)) pragma(msg, Parameters!scale[1..2].stringof); // (lazy double) So this gives us a solution for handling the second parameter of scale: ReturnType!scale dispatcher(???, Parameters!scale[1..2]) { ... } But what can we put in place of ???. RemoveVirtual!(Parameters!scale[0..1]) would not work. RemoveVirtual expects a type, and Parameters!scale[1..2] is not a type—it is a sort of conglomerate that contains a type, and perhaps storage classes, type constructors, and UDAs. At this point, we have no other choice but to construct a string mixin once again. Something like this: mixin(q{ %s ReturnType!(scale) dispatcher( %s RemoveVirtual!(Parameters!(scale)[1]) a, Parameters!(scale)[1..2] b) { resolve(a)(a, b); } }.format( functionAttributes!scale & FunctionAttribute.nogc ? "@nogc " : "" /* also handle other function attributes */, __traits(getParameterStorageClasses, scale, 0))); pragma(msg, typeof(dispatcher)); // @nogc void(ref double a, lazy double) This is not quite sufficient though, because it still doesn’t take care of parameter UDAs. To Boltly Refract… openmethods once contained kilometers of mixin code like the above. Such heavy use of string mixins was too ugly and messy, so much so that the project began to feel less like fun and more like work. So I decided to sweep all the ugliness under a neat interface, once and for all. The result was a “refraction” module, which I later carved out of openmethods and donated to Ali Akhtarzada’s excellent bolts library. bolts attempts to fill in the gaps and bring some regularity to D’s motley set of meta-programming features. refraction’s entry point is the refract function template. It takes a function and an “anchor” string, and returns an immutable Function object that captures all the aspects of a function. Function objects can be used at compile-time. It is, actually, their raison d’être. Function has a mixture property that returns a declaration for the original function. For example: Matrix times(virtual!Matrix a, double b); pragma(msg, refract!(times, "times").mixture); // @system ReturnType!(times) times(Parameters!(times) _0); Why does refract need the anchor string? Can’t the string "times" be inferred from the function by means of __traits(identifier...)? Yes, it can, but in real applications we don’t want to use this. The whole point of the library is to be used in templates, where the function is typically passed to refract via an alias. In general, the function’s name has no meaning in the template’s scope—or if, by chance, the name exists, it does not name the function. All the meta-expressions used to dissect the function must work in terms of the local symbol that identifies the alias. Consider: module matrix; Matrix times(virtual!Matrix a, double b); Method!times timesMethod; // openmethods creates a Method object for each // declared method module openmethods; struct Method(alias fun) { enum returnTypeMixture = refract!(fun, "fun").returnType; pragma(msg, returnTypeMixture); // ReturnType!(fun) mixin("alias R = ", returnTypeMixture, ";"); // ok pragma(msg, R.stringof); // Matrix } There is no times and no Matrix in module openmethods. Even if they existed, they could not be the times function and the Matrix class from module matrix, as this would require a circular dependency between the two modules, something that D forbids by default. However, there is a fun symbol, and it aliases to the function; thus, the return type can be expressed as ReturnType!(fun). All aspects of the function are available piecemeal. For example: @nogc void scale(ref virtual!Matrix m, lazy double by); pragma(msg, refract!(scale, "scale").parameters[0].storageClasses); // ["ref"] Function also has methods that return a new Function object, with an alteration to one of the aspects. They can be used to create a variation of a function. For example: pragma(msg, refract!(scale, "scale") .withName("dispatcher") .withBody(q{{ resolve(_0[0])(_0); }}) .mixture ); @nogc @system ReturnType!(scale) dispatcher(ref Parameters!(scale)[0] _0, lazy Parameters!(scale)[1] _1) { resolve(_0[0])(_0); } This is the reason behind the name “refraction”: the module creates a blueprint of a function, performs some alterations on it, and returns a string—called a mixture—which, when passed to mixin, will create a new function. openmethods needs to change the type of the first parameter while preserving storage classes. With bolts.experimental.refraction, this becomes easy: original = refract!(scale, "scale"); pragma(msg, original .withName("dispatcher") .withParameters( [original.parameters[0].withType( "RemoveVirtual!(%s)".format(original.parameters[0].type)), original.parameters[1], ]) .withBody(q{{ return resolve(_0)(%s); }}.format(original.argumentMixture)) ); This time, the generated code splits the parameter pack into individual components: @nogc @system ReturnType!(scale) dispatcher( ref RemoveVirtual!(Parameters!(scale)[0]) _0, Parameters!(scale)[1..2] _1) { return resolve(_0)(_0); } Note how the first and second parameters are handled differently. The first parameter is cracked open because we need to replace the type. That forces us to access the first Parameters value via indexing, and that loses the storage classes, UDAs, etc. So they need to be re-applied explicitly. On the other hand, the second parameter does not have this problem. It is not edited; thus, the Parameters slice trick can be used. The lazy is indeed there, but it is inside the parameter conglomerate. Conclusion Initially, D looked almost as good as Lisp for generating functions. As we tried to gain finer control of the generated function, our code started to look a lot more like C macros; in fact, in some respects, it was even worse: we had to put an entire function definition in a string mixin just to set its name. This is due to the fact that D is not as “regular” a language as Lisp. Some of the people helming the evolution of D are working on changing this, and it is my hope that an improved D will emerge in the not-too-distant future. In the meantime, the experimental refraction module from the bolts meta-programming library offers a saner, easier way of generating functions without compromising on the idiosyncrasies that come with them. It allows you to pretend that functions can be disassembled and reassembled at will, while hiding all the gory details of the string mixins that are necessarily involved in that task. Jean-Louis Leroy is not French, but Belgian. He got his first taste of programming from a HP-25 calculator. His first real programming language was Forth, where CTFE is pervasive. Later he programmed (a little) in Lisp and Smalltalk, and (a lot) in C, C++, and Perl. He now works for Bloomberg LP in New York. His interests include object-relational mapping, open multi-methods, DSLs, and language extensions in general. The ABC’s of Templates in D D was not supposed to have templates. Several months before Walter Bright released the first alpha version of the DMD compiler in December 2001, he completed a draft language specification in which he wrote: Templates. A great idea in theory, but in practice leads to numbingly complex code to implement trivial things like a “next” pointer in a singly linked list. The (freely available) HOPL IV paper, Origins of the D Programming Language, expands on this: [Walter] initially objected to templates on the grounds that they added disproportionate complexity to the front end, they could lead to overly complex user code, and, as he had heard many C++ users complain, they had a syntax that was difficult to understand. He would later be convinced to change his mind. It did take some convincing. As activity grew in the (then singular) D newsgroup, requests that templates be added to the language became more frequent. Eventually, Walter started mulling over how to approach templates in a way that was less confusing for both the programmer and the compiler. Then he had an epiphany: the difference between template parameters and function parameters is one of compile time vs. run time. From this perspective, there’s no need to introduce a special template syntax (like the C++ style <T>) when there’s already a syntax for parameter lists in the form of (T). So a template declaration in D could look like this: template foo(T, U) { // template members here } From there, the basic features fell into place and were expanded and enhanced over time. In this article, we’re going to lay the foundation for future articles by introducing the most basic concepts and terminology of D’s template implementation. If you’ve never used templates before in any language, this may be confusing. That’s not unexpected. Even though many find D’s templates easier to understand than other implementations, the concept itself can still be confusing. You’ll find links at the end to some tutorial resources to help build a better understanding. Template declarations Inside a template declaration, one can nest any other valid D declaration: template foo(T, U) { int x; T y; struct Bar { U thing; } void doSomething(T t, U u) { ... } } In the above example, the parameters T and U are template type parameters, meaning they are generic substitutes for concrete types, which might be built-in types like int, float, or any type the programmer might implement with a class or struct declaration. By declaring a template, it’s possible to, for example, write a single implementation of a function like doSomething that can accept multiple types for the same parameters. The compiler will generate as many copies of the function as it needs to accomodate the concrete types used in each unique template instantiation. Other kinds of parameters are supported: value parameters, alias parameters, sequence (or variadic) parameters, and this parameters, all of which we’ll explore in future blog posts. One name to rule them all In practice, it’s not very common to implement templates with multiple members. By far, the most common form of template declaration is the single-member eponymous template. Consider the following: template max(T) { T max(T a, T b) { ... } } An eponymous template can have multiple members that share the template name, but when there is only one, D provides us with an alternate template declaration syntax. In this example, we can opt for a normal function declaration that has the template parameter list in front of the function parameter list: T max(T)(T a, T b) { ... } The same holds for eponymous templates that declare an aggregate type: // Instead of the longhand template declaration... /* template MyStruct(T, U) { struct MyStruct { T t; U u; } } */ // ...just declare a struct with a type parameter list struct MyStruct(T, U) { T t; U u; } Eponymous templates also offer a shortcut for instantiation, as we’ll see in the next section. Instantiating templates In relation to templates, the term instantiate means essentially the same as it does in relation to classes and structs: create an instance of something. A template instance is, essentially, a concrete implementation of the template in which the template parameters have been replaced by the arguments provided in the instantiation. For a template function, that means a new copy of the function is generated, just as if the programmer had written it. For a type declaration, a new copy of the declaration is generated, just as if the programmer had written it. We’ll see an example, but first we need to see the syntax. Explicit instantiations An explicit instantiation is a template instance created by the programmer using the template instantiation syntax. To easily disambiguate template instantiations from function calls, D requires the template instantiation operator, !, to be present in explicit instantiations. If the template has multiple members, they can be accessed in the same manner that members of aggregates are accessed: using dot notation. import std; template Temp(T, U) { T x; struct Pair { T t; U u; } } void main() { Temp!(int, float).x = 10; Temp!(int, float).Pair p; p.t = 4; p.u = 3.2; writeln(Temp!(int, float).x); writeln(p); } Run it online at run.dlang.io There is one template instantiation in this example: Temp!(int, float). Although it appears three times, it refers to the same instance of Temp, one in which T == int and U == float. The compiler will generate declarations of x and the Pair type as if the programmer had written the following: int x; struct Pair { int t; float u; } However, we can’t just refer to x and Pair by themselves. We might have other instantiations of the same template, like Temp!(double, long), or Temp(MyStruct, short). To avoid conflict, the template members must be accessed through a namespace unique to each instantiation. In that regard, Temp!(int, float) is like a class or struct with static members; just as you would access a static x member variable in a struct Foo using the struct name, Foo.x, you access a template member using the template name, Temp!(int, float).x. There is only ever one instance of the variable x for the instantiation Temp!(int float), so no matter where you use it in a code base, you will always be reading and writing the same x. Hence, the first line of main isn’t declaring and initializing the variable x, but is assigning to the already declared variable. Temp!(int, float).Pair is a struct type, so that after the declaration Temp!(int, float).Pair p, we can refer to p by itself. Unlike x, p is not a member of the template. The type Pair is a member, so we can’t refer to it without the prefix. Aliased instantiations It’s possible to simplify the syntax by using an alias declaration for the instantiation: import std; template Temp(T, U) { T x; struct Pair { T t; U u; } } alias TempIF = Temp!(int, float); void main() { TempIF.x = 10; TempIF.Pair p = TempIF.Pair(4, 3.2); writeln(TempIF.x); writeln(p); } Run it online at run.dlang.io Since we no longer need to type the template argument list, using a struct literal to initialize p, in this case TempIF.Pair(3, 3.2), looks cleaner than it would with the template arguments. So I opted to use that here rather than first declare p and then initialize its members. We can trim it down still more using D’s auto attribute, but whether this is cleaner is a matter of preference: auto p = TempIF.Pair(4, 3.2); Run it online at run.dlang.io Instantiating eponymous templates Not only do eponymous templates have a shorthand declaration syntax, they also allow for a shorthand instantiation syntax. Let’s take the x out of Temp and rename the template to Pair. We’re left with a Pair template that provides a declaration struct Pair. Then we can take advantage of both the shorthand declaration and instantiation syntaxes: import std; struct Pair(T, U) { T t; U u; } // We can instantiate Pair without the dot operator, but still use // the alias to avoid writing the argument list every time alias PairIF = Pair!(int, float); void main() { PairIF p = PairIF(4, 3.2); writeln(p); } Run it online at run.dlang.io The shorthand instantiation syntax means we don’t have to use the dot operator to access the Pair type. Even shorter syntax When a template instantiation is passed only one argument, and the argument’s symbol is a single token (e.g., int as opposed to int[] or int*), the parentheses can be dropped from the template argument list. Take the standard library template function std.conv.to as an example: void main() { import std.stdio : writeln; import std.conv : to; writeln(to!(int)("42")); } Run it online at run.dlang.io std.conv.to is an eponymous template, so we can use the shortened instantiation syntax. But the fact that we’ve instantiated it as an argument to the writeln function means we’ve got three pairs of parentheses in close proximity. That sort of thing can impair readability if it pops up too often. We could move it out and store it in a variable if we really care about it, but since we’ve only got one template argument, this is a good place to drop the parentheses from the template argument list. writeln(to!int("42")); Whether that looks better is another case where it’s up to preference, but it’s fairly idiomatic these days to drop the parentheses for a single template argument no matter where the instantiation appears. Not done with the short stuff yet std.conv.to is an interesting example because it’s an eponymous template with multiple members that share the template name. That means that it must be declared using the longform syntax (as you can see in the source code), but we can still instantiate it without the dot notation. It’s also interesting because, even though it accepts two template arguments, it is generally only instantiated with one. This is because the second template argument can be deduced by the compiler based on the function argument. For a somewhat simpler example, take a look at std.utf.toUTF8: void main() { import std.stdio : writeln; import std.utf : toUTF8; string s1 = toUTF8("This was a UTF-16 string."w); string s2 = toUTF8("This was a UTF-32 string."d); writeln(s1); writeln(s2); } Run it online at run.dlang.io Unlike std.conv.to, toUTF8 takes exactly one parameter. The signature of the declaration looks like this: string toUTF8(S)(S s) But in the example, we aren’t passing a type as a template argument. Just as the compiler was able to deduce to’s second argument, it’s able to deduce toUTF8’s sole argument. toUTF8 is an eponymous function template with a template parameter S and a function parameter s of type S. There are two things we can say about this: 1) the return type is independent of the template parameter and 2) the template parameter is the type of the function parameter. Because of 2), the compiler has all the information it needs from the function call itself and has no need for the template argument in the instantiation. Take the first call to the toUTF8 function in the declaration of s1. In long form, it would be toUTF8!(wstring)("blah"w). The w at the end of the string literal indicates it is of type wstring, with UTF-16 encoding, as opposed to string, with UTF-8 encoding (the default for string literals). In this situation, having to specify !(wstring) in the template instantiation is completely redundant. The compiler already knows that the argument to the function is a wstring. It doesn’t need the programmer to tell it that. So we can drop the template instantiation operator and the template argument list, leaving what looks like a simple function call. The compiler knows that toUTF8 is a template, knows that the template is declared with one type parameter, and knows that the type should be wstring. Similarly, the d suffix on the literal in the initialization of s2 indicates a UTF-32 encoded dstring, and the compiler knows all it needs to know for that instantiation. So in this case also, we drop the template argument and make the instantiation appear as a function call. It does seem silly to convert a wstring or dstring literal to a string when we could just drop the w and d prefixes and have string literals that we can directly assign to s1 and s2. Contrived examples and all that. But the syntax the examples are demonstrating really shines when we work with variables. wstring ws = "This is a UTF-16 string"w; string s = ws.toUTF8; writeln(s); Run it online at run.dlang.io Take a close look at the initialization of s. This combines the shorthand template instantiation syntax with Uniform Function Call Syntax (UFCS) and D’s shorthand function call syntax. We’ve already seen the template syntax in this post. As for the other two: • UFCS allows using dot notation on the first argument of any function call so that it appears as if a member function is being called. By itself, it doesn’t offer much benefit aside from, some believe, readability. Generally, it’s a matter of preference. But this feature can seriously simplify the implementation of generic templates that operate on aggregate types and built-in types. • The parentheses in a function call can be dropped when the function takes no arguments, so that foo() becomes foo. In this case, the function takes one argument, but we’ve taken it out of the argument list using UFCS, so the argument list is now empty and the parentheses can be dropped. (The compiler will lower this to a normal function call, toUTF8(ws)—it’s purely programmer convenience.) When and whether to do this in the general case is a matter of preference. The big win, again, is in the simplification of template implementations: a template can be implemented to accept a type T with a member variable foo or a member function foo, or a free function foo for which the first argument is of type T. All of this shorthand syntax is employed to great effect in D’s range API, which allows chained function calls on types that are completely hidden from the public-facing API (aka Voldemort types). More to come In future articles, we’ll explore the different kinds of template parameters, introduce template constraints, see inside a template instantiation, and take a look at some of the ways people combine templates with D’s powerful compile-time features in the real world. In the meantime, here are some template tutorial resources to keep you busy: • Ali Çehreli’s Programming in D is an excellent introduction to D in general, suitable even for those with little programming experience. The two chapters on templates (the first called ‘Templates’ and the second ‘More Templates’) provide a great introduction. (The online version of the book is free, but if you find it useful, please consider throwing Ali some love by buying the ebook or print version linked at the top of the TOC.) • More experienced programmers may find Phillipe Sigaud’s ‘D Template Tutorial’ a good read. It’s almost a decade old, but still relevant (and still free!). This tutorial goes beyond the basics into some of the more advanced template features. It includes a look at D’s compile-time features, provides a number of examples, and sports an appendix detailing the is expression (a key component of template constraints). It can also serve as a great reference when reading open source D code until you learn your way around D templates. There are other resources, though other than my book ‘Learning D’ (this is a referral link that will benefit the D Language Foundation), I’m not aware of any that provide as much detail as the above. (And my book is currently not freely available). Eventually, we’ll be able to add this blog series to the list. Thanks to Stefan Koch for reviewing this article. A Pattern for Head-mutable Structures When Andrei Alexandrescu introduced ranges to the D programming language, the gap between built-in and user-defined types (UDTs) narrowed, enabling new abstractions and greater composability. Even today, though, UDTs are still second-class citizens in D. One example of this is support for head mutability—the ability to manipulate a reference without changing the referenced value(s). This document details a pattern that will further narrow the UDT gap by introducing functions for defining and working with head-mutable user-defined types. Introduction D is neither Kernel nor Scheme—it has first-class and second-class citizens. Among its first-class citizens are arrays and pointers. One of the benefits these types enjoy is implicit conversion to head-mutable. For instance, const(T[]) is implicitly convertible to const(T)[]. Partly to address this difference, D has many ways to define how one type may convert to or behave like another – alias this, constructors, opDispatch, opCast, and, of course, subclassing. The way pointers and dynamic arrays decay into their head-mutable variants is different from the semantics of any of these features, so we would need to define a new type of conversion if we were to mimic this behavior. Changing the compiler and the language to permit yet another way of converting one type into another is not desirable: it makes the job harder for compiler writers, makes an already complex language even harder to learn, and any implicit conversion can make code harder to read and maintain. If we can define conversions to head-mutable data structures without introducing compiler or language changes, this will also make the feature available to users sooner, since such a mechanism would not necessarily require changes in the standard library, and users could gradually implement it in their own code and benefit from the code in the standard library catching up at a later point. Unqual The tool used today to get a head-mutable version of a type is std.traits.Unqual. In some cases, this is the right tool—it strips away one layer of const, immutable, inout, and shared. For some types though, it either does not give a head-mutable result, or it gives a head-mutable result with mutable indirections: struct S(T) { T[] arr; } With Unqual, this code fails to compile: void foo(T)(T a) { Unqual!T b = a; // cannot implicitly convert immutable(S!int) to S!int } unittest { immutable s = S!int([1,2,3]); foo(s); } A programmer who sees that message hopefully finds a different way to achieve the same goal. However, the error message says that the conversion failed, indicating that a conversion is possible, perhaps even without issue. An inexperienced programmer, or one who knows that doing so is safe right now, could use a cast to shut the compiler up: void bar(T)(T a) { Unqual!T b = cast(Unqual!T)a; b.arr[0] = 4; } unittest { immutable s = S!int([1,2,3]); bar(s); assert(s.arr[0] == 1); // Fails, since bar() changed it. } If, instead of S!int, the programmer had used int[], the first example would have compiled, and the cast in the second example would have never seen the light of day. However, since S!int is a user-defined type, we are forced to write a templated function that either fails to compile for some types it really should support or gives undesirable behavior at run time. headMutable() Clearly, we should be able to do better than Unqual, and in fact we can. D has template this parameters which pick up on the dynamic type of the this reference, and with that, its const or immutable status: struct S { void foo(this T)() { import std.stdio : writeln; writeln(T.stringof); } } unittest { S s1; const S s2; s1.foo(); // Prints "S". s2.foo(); // Prints "const(S)". } This way, the type has the necessary knowledge of which type qualifiers a head-mutable version needs. We can now define a method that uses this information to create the correct head-mutable type: struct S(T) { T[] arr; auto headMutable(this This)() const { import std.traits : CopyTypeQualifiers; return S!(CopyTypeQualifiers!(This, T))(arr); } } unittest { const a = S!int([1,2,3]); auto b = a.headMutable(); assert(is(typeof(b) == S!(const int))); // The correct part of the type is now const. assert(a.arr is b.arr); // It's the same array, no copying has taken place. b.arr[0] = 3; // Correctly fails to compile: cannot modify const expression. } Thanks to the magic of Uniform Function Call Syntax, we can also define headMutable() for built-in types: auto headMutable(T)(T value) { import std.traits; import std.typecons : rebindable; static if (isPointer!T) { // T is a pointer and decays naturally. return value; } else static if (isDynamicArray!T) { // T is a dynamic array and decays naturally. return value; } else static if (!hasAliasing!(Unqual!T)) { // T is a POD datatype - either a built-in type, or a struct with only POD members. return cast(Unqual!T)value; } else static if (is(T == class)) { // Classes are reference types, so only the reference may be made head-mutable. return rebindable(value); } else static if (isAssociativeArray!T) { // AAs are reference types, so only the reference may be made head-mutable. return rebindable(value); } else { static assert(false, "Type "~T.stringof~" cannot be made head-mutable."); } } unittest { const(int*[3]) a = [null, null, null]; auto b = a.headMutable(); assert(is(typeof(b) == const(int)*[3])); } Now, whenever we need a head-mutable variable to point to tail-const data, we can simply call headMutable() on the value we need to store. Unlike the ham-fisted approach of casting to Unqual!T, which may throw away important type information and also silences any error messages that may inform you of the foolishness of your actions, attempting to call headMutable() on a type that doesn’t support it will give an error message explaining what you tried to do and why it didn’t work (“Type T cannot be made head-mutable.”). The only thing missing now is a way to get the head-mutable type. Since headMutable() returns a value of that type, and is defined for all types we can convert to head-mutable, that’s a template one-liner: import std.traits : ReturnType; alias HeadMutable(T) = ReturnType!((T t) => t.headMutable()); Where Unqual returns a type with potentially the wrong semantics and only gives an error once you try assigning to it, HeadMutable disallows creating the type in the first place. The programmer will have to deal with that before casting or otherwise coercing a value into the variable. Since HeadMutable uses headMutable() to figure out the type, it also gives the same informative error message when it fails. Lastly, since one common use case requires us to preserve the tail-const or tail-immutable properties of a type, it is beneficial to define a template that converts to head-mutable while propagating const or immutable using std.traits.CopyTypeQualifiers: import std.traits : CopyTypeQualifiers; alias HeadMutable(T, ConstSource) = HeadMutable!(CopyTypeQualifiers!(ConstSource, T)); This way, immutable(MyStruct!int) can become MyStruct!(immutable int), while the const version would propagate constness instead of immutability. Example Code Since the pattern for range functions in Phobos is to have a constructor function (e.g. map) that forwards its arguments to a range type (e.g. MapResult), the code changes required to use headMutable() are rather limited. Likewise, user code should generally not need to change at all in order to use headMutable(). To give an impression of the code changes needed, I have implemented map and equal: import std.range; // Note that we check not if R is a range, but if HeadMutable!R is auto map(alias Fn, R)(R range) if (isInputRange!(HeadMutable!R)) { // Using HeadMutable!R and range.headMutable() here. // This is basically the extent to which code that uses head-mutable data types will need to change. return MapResult!(Fn, HeadMutable!R)(range.headMutable()); } struct MapResult(alias Fn, R) { R range; this(R _range) { range = _range; } void popFront() { range.popFront(); } @property auto ref front() { return Fn(range.front); } @property bool empty() { return range.empty; } static if (isBidirectionalRange!R) { @property auto ref back() { return Fn(range.back); } void popBack() { range.popBack(); } } static if (hasLength!R) { @property auto length() { return range.length; } alias opDollar = length; } static if (isRandomAccessRange!R) { auto ref opIndex(size_t idx) { return Fn(range[idx]); } } static if (isForwardRange!R) { @property auto save() { return MapResult(range.save); } } static if (hasSlicing!R) { auto opSlice(size_t from, size_t to) { return MapResult(range[from..to]); } } // All the above is as you would normally write it. // We also need to implement headMutable(). // Generally, headMutable() will look very much like this - instantiate the same // type template that defines typeof(this), use HeadMutable!(T, ConstSource) to make // the right parts const or immutable, and call headMutable() on fields as we pass // them to the head-mutable type. auto headMutable(this This)() const { alias HeadMutableMapResult = MapResult!(Fn, HeadMutable!(R, This)); return HeadMutableMapResult(range.headMutable()); } } auto equal(R1, R2)(R1 r1, R2 r2) if (isInputRange!(HeadMutable!R1) && isInputRange!(HeadMutable!R2)) { // Need to get head-mutable version of the parameters to iterate over them. auto _r1 = r1.headMutable(); auto _r2 = r2.headMutable(); while (!_r1.empty && !_r2.empty) { if (_r1.front != _r2.front) return false; _r1.popFront(); _r2.popFront(); } return _r1.empty && _r2.empty; } unittest { // User code does not use headMutable at all: const arr = [1,2,3]; const squares = arr.map!(a => a*a); const squaresPlusTwo = squares.map!(a => a+2); assert(equal(squaresPlusTwo, [3, 6, 11])); } (Note that these implementations are simplified slightly from Phobos code to better showcase the use of headMutable) The unittest block shows a use case where the current Phobos map would fail—it is perfectly possible to create a const MapResult, but there is no way of iterating over it. Note that only two functions are impacted by the addition of headMutable(): map tests if HeadMutable!R is an input range and converts its arguments to head-mutable when passing them to MapResult, and MapResult needs to implement headMutable(). The rest of the code is exactly as you would otherwise write it. The implementation of equal() shows a situation where implicit conversions would be beneficial. For const(int[]) the call to headMutable() is superfluous—it is implicitly converted to const(int)[] when passed to the function. For user-defined types however, this is not the case, so the call is necessary in the general case. While I have chosen to implement a range here, ranges are merely the most common example of a place where headmutable would be useful; the idea has merits beyond ranges. Another type in the standard library that would benefit from headmutable is RefCounted!T: const(RefCounted!(T)) should convert to RefCounted!(const(T)). Why not Tail-Const? In previous discussions of this problem, the solution has been described as tail-const, and a function tailConst() has been proposed. While this idea might at first seem the most intuitive solution, it has some problems, which together make headMutable() far superior. The main problem with tailConst() is that it does not play well with D’s existing const system. It needs to be called on a mutable value, and there is no way to convert a const(Foo!T) to Foo!(const(T)). It thus requires that the programmer explicitly call tailConst() on any value that is to be passed to a function expecting a non-mutable value and, abstain from using const or immutable to convey the same information. This creates a separate world of tail-constness and plays havoc with generic code, which consequently has no way to guarantee that it won’t mutate its arguments. Secondly, the onus is placed on library users to call tailConst() whenever they pass an argument anywhere, causing an inversion of responsibility: the user has to tell the library that it is not allowed to edit the data instead of the other way around. In the best case, this merely causes unnecessary verbiage. In other cases, the omission of const will lead to mutation of data expected to be immutable. A minor quibble in comparison is that the tail-const solution also requires the existence of tailImmutable to cover the cases where the values are immutable. Issues The ideas outlined in this document concern only conversion to head-mutable. A related issue is conversion to tail const, e.g. from RefCounted!T or RefCounted!(immutable T) to RefCounted!(const T), a conversion that, again, is implicit for arrays and pointers in D today. One issue that may be serious is the fact that headMutable often cannot be @safe and may, in fact, need to rely on undefined behavior in some places. For instance, RefCounted!T contains a pointer to the actual ref count. For immutable(RefCounted!T), headMutable() would need to cast away immutable, which is undefined behavior per the spec. The Compiler Solution It is logical to think that, as with built-in types, headMutable() could be elided in its entirety, and the compiler could handle the conversions for us. In many cases, this would be possible, and in fact the compiler already does so for POD types like struct S { int n; }—a const or immutable S may be assigned to a mutable variable of type S. This breaks down, however, when the type includes some level of mutable indirection. For templated types it would be possible to wiggle the template parameters to see if the resulting type compiles and has fields with the same offsets and similar types, but even such an intelligent solution breaks down in the presence of D’s Turing-complete template system, and some cases will always need to be handled by the implementer of a type. It is also a virtue that the logic behind such an implementation be understandable to the average D programmer. The best case result of that not being true is that the forums would be inundated with a flood of posts about why types don’t convert the way users expect them to. For these reasons, headMutable() will be necessary even with compiler support. But what would that support look like? Implicit casting to head-mutable happens in the language today in two situations: • Assignment to head-mutable variables: const(int)[] a = create!(const(int[]))(); (all POD types, pointers and arrays) • Function calls: fun(create!(const(int[]))(); (only pointers and arrays) The first is covered by existing language features (alias headMutable this; fits the bill perfectly). The second is not but is equivalent to calling .headMutable whenever a const or immutable value is passed to a function that does not explicitly expect a const or immutable argument. This would change the behavior of existing code, in that templated functions would prefer a.headMutable over a, but would greatly improve the experience of working with const types that do define headMutable(). If headMutable is correctly implemented, the different choice of template instantiations should not cause any actual breakage. Future Work While this document proposes to implement the described feature without any changes to the compiler or language, it would be possible for the compiler in the future to recognize headMutable() and call it whenever a type that defines that method is passed to a function that doesn’t explicitly take exactly that type, or upon assignment to a variable that matches headMutable()’s return value. This behavior would mirror the current behavior of pointers and arrays. Conclusion It is possible to create a framework for defining head-mutable types in D today without compiler or language changes. It requires a little more code in the methods that use head-mutable types but offers a solution to a problem that has bothered the D community for a long time. While this document deals mostly with ranges, other types will also benefit from this pattern: smart pointers and mutable graphs with immutable nodes are but two possible examples. Definitions Head-mutable A type is head-mutable if some or all of its members without indirections are mutable. Note that a head-mutable datatype may also have const or immutable members without indirections; the requirement is merely that some subset of its members may be mutated. A head-mutable datatype may be tail-const, tail-immutable or tail-mutable—head-mutable only refers to its non-indirected members. Examples of head-mutable types include const(int)[], int*, string, and Rebindable!MyClass. Types without indirections (like int, float and struct S { int n; }) are trivially head-mutable. Tail-const A type is tail-const if some of its members with indirections have the const type qualifier. A tail-const type may be head-mutable or head-const. Examples of tail-const types are const(int)*, const(int[]), const(immutable(int)[])* and string. Source The source code for HeadMutable and headMutable is available here. A Look at Chapel, D, and Julia Using Kernel Matrix Calculations Introduction It seems each time you turn around there is a new programming language aimed at solving some specific problem set. Increased proliferation of programming languages and data are deeply connected in a fundamental way, and increasing demand for “data science” computing is a related phenomenon. In the field of scientific computing, Chapel, D, and Julia are highly relevant programming languages. They arise from different needs and are aimed at different problem sets: Chapel focuses on data parallelism on single multi-core machines and large clusters; D was initially developed as a more productive and safer alternative to C++; Julia was developed for technical and scientific computing and aimed at getting the best of both worlds—the high performance and safety of static programming languages and the flexibility of dynamic programming languages. However, they all emphasize performance as a feature. In this article, we look at how their performance varies over kernel matrix calculations and present approaches to performance optimization and other usability features of the languages. Kernel matrix calculations form the basis of kernel methods in machine learning applications. They scale rather poorly—O(m n^2), where n is the number of items and m is the number of elements in each item. In our exercsie, m will be constant and we will be looking at execution time in each implementation as n increases. Here m = 784 and n = 1k, 5k, 10k, 20k, 30k, each calculation is run three times and an average is taken. We disallow any use of BLAS and only allow use of packages or modules from the standard library of each language, though in the case of D the benchmark is compared with calculations using Mir, a multidimensional array package, to make sure that my matrix implementation reflects the true performance of D. The details for the calculation of the kernel matrix and kernel functions are given here. While preparing the code for this article, the Chapel, D, and Julia communities were very helpful and patient with all inquiries, so they are acknowledged here. In terms of bias, going in I was much more familiar with D and Julia than I was with Chapel. However, getting the best performance from each language required a lot of interaction with each programming community, and I have done my best to be aware of my biases and correct for them where necessary. Language Benchmarks for Kernel Matrix Calculation The above chart (generated using R’s ggplot2 using a script) shows the performance benchmark time taken against the number of items n for Chapel, D, and Julia, for nine kernels. D performs best in five of the nine kernels, Julia performs best in two of the nine kernels, and in two of the kernels (Dot and Gaussian) the picture is mixed. Chapel was the slowest for all of the kernel functions examined. It is worth noting that the mathematics functions used in D were pulled from C’s math API made available in D through its core.stdc.math module because the mathematical functions in D’s standard library std.math can be quite slow. The math functions used are given here. By way of comparison, consider the mathdemo.d script comparing the imported C log function D’s log function from std.math: $ ldc2 -O --boundscheck=off --ffast-math --mcpu=native --boundscheck=off mathdemo.d && ./mathdemo
Time taken for c log: 0.324789 seconds.
Time taken for d log: 2.30737 seconds.

The Matrix object used in the D benchmark was implemented specifically because the use of modules outside standard language libraries was disallowed. To make sure that this implementation is competitive, i.e., it does not unfairly represent D’s performance, it is compared to Mir’s ndslice library written in D. The chart below shows matrix implementation times minus ndslice times; negative means that ndslice is slower, indicating that the implementation used here does not negatively represent D’s performance.

Environment

The code was run on a computer with an Ubuntu 20.04 OS, 32 GB memory, and an Intel® Core™ i9–8950HK CPU @ 2.90GHz with 6 cores and 12 threads.

$julia --version julia version 1.4.1 $ dmd --version
DMD64 D Compiler v2.090.1
ldc2 --version
LDC - the LLVM D compiler (1.18.0):
based on DMD v2.088.1 and LLVM 9.0.0
\$ chpl --version
chpl version 1.22.0

Compilation

Chapel:

chpl script.chpl kernelmatrix.chpl --fast && ./script

D:

ldc2 script.d kernelmatrix.d arrays.d -O5 --boundscheck=off --ffast-math -mcpu=native && ./script

Julia (no compilation required but can be run from the command line):

julia script.jl

Implementations

Efforts were made to avoid non-standard libraries while implementing these kernel functions. The reasons for this are:

• To make it easy for the reader after installing the language to copy and run the code. Having to install external libraries can be a bit of a “faff”.
• Packages outside standard libraries can go extinct, so avoiding external libraries keeps the article and code relevant.
• It’s completely transparent and shows how each language works.

Chapel

Chapel uses a forall loop to parallelize over threads. Also, C pointers to each item are used rather than the default array notation, and guided iteration over indices is used:

proc calculateKernelMatrix(K, data: [?D] ?T)
{
var n = D.dim(0).last;
var p = D.dim(1).last;
var E: domain(2) = {D.dim(0), D.dim(0)};
var mat: [E] T;
var rowPointers: [1..n] c_ptr(T) =
forall i in 1..n do c_ptrTo(data[i, 1]);

forall j in guided(1..n by -1) {
for i in j..n {
mat[i, j] = K.kernel(rowPointers[i], rowPointers[j], p);
mat[j, i] = mat[i, j];
}
}
return mat;
}

Chapel code was the most difficult to optimize for performance and required the highest number of code changes.

D

D uses a taskPool of threads from its std.parallel package to parallelize code. The D code underwent the fewest number of changes for performance optimization—a lot of the performance benefits came from the specific compiler used and the flags selected (discussed later). My implementation of a Matrix allows columns to be selected by reference via refColumnSelect.

auto calculateKernelMatrix(alias K, T)(K!(T) kernel, Matrix!(T) data)
{
long n = data.ncol;
auto mat = Matrix!(T)(n, n);

{
auto arrj = data.refColumnSelect(j).array;
foreach(long i; j..n)
{
mat[i, j] = kernel(data.refColumnSelect(i).array, arrj);
mat[j, i] = mat[i, j];
}
}
return mat;
}

Julia

The Julia code uses the @threads macro for parallelising the code and @views macro for referencing arrays. One confusing thing about Julia’s arrays is their reference status. Sometimes, as in this case, arrays will behave like value objects and they have to be referenced by using the @views macro, otherwise they generate copies. At other times they behave like reference objects, for example, when passing them into a function. It can be a little tricky dealing with this because you don’t always know what set of operations will generate a copy, but where this occurs @views provides a good solution.

The Symmetric type saves the small bit of extra work needed for allocating to both sides of the matrix.

function calculateKernelMatrix(Kernel::K, data::Array{T}) where {K <: AbstractKernel,T <: AbstractFloat}
n = size(data)[2]
mat = zeros(T, n, n)
@views for i in j:n
mat[i,j] = kernel(Kernel, data[:, i], data[:, j])
end
end
return Symmetric(mat, :L)
end

The @bounds and @simd macros in the kernel functions were used to turn bounds checking off and apply SIMD optimization to the calculations:

struct DotProduct <: AbstractKernel end
@inline function kernel(K::DotProduct, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T,N}
ret = zero(T)
m = length(x)
@inbounds @simd for k in 1:m
ret += x[k] * y[k]
end
return ret
end

These optimizations are quite visible but very easy to apply.

Memory Usage

The total time for each benchmark and the total memory used was captured using the /usr/bin/time -v command. The output for each of the languages is given below.

Chapel took the longest total time but consumed the least amount of memory (nearly 6GB RAM peak memory):

Command being timed: "./script"
User time (seconds): 113190.32
System time (seconds): 6.57
Percent of CPU this job got: 1196%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:37:39
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 5761116
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 1439306
Voluntary context switches: 653
Involuntary context switches: 1374820
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

D consumed the highest amount of memory (around 20GB RAM peak memory) but took less total time than Chapel to execute:

Command being timed: "./script"
User time (seconds): 106065.71
System time (seconds): 58.56
Percent of CPU this job got: 1191%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:28:29
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 20578840
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 18249033
Voluntary context switches: 3833
Involuntary context switches: 1782832
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Julia consumed a moderate amount of memory (around 7.5 GB peak memory) but ran the quickest—probably because its random number generator is the fastest:

Command being timed: "julia script.jl"
User time (seconds): 49794.85
System time (seconds): 30.58
Percent of CPU this job got: 726%
Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:18
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 7496184
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 794
Minor (reclaiming a frame) page faults: 38019472
Voluntary context switches: 2629
Involuntary context switches: 523063
Swaps: 0
File system inputs: 368360
File system outputs: 8
Socket messages sent: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

Performance optimization

The process of performance optimization in all three languages was very different, and all three communities were very helpful in the process. But there were some common themes.

• Static dispatching of kernel functions instead of using polymorphism. This means that when passing the kernel function, use parametric (static compile time) polymorphism rather than runtime (dynamic) polymorphism where dispatch with virtual functions carries a performance penalty.
• Using views/references rather than copying data over multiple threads makes a big difference.
• Parallelising the calculations makes a huge difference.
• Knowing if your array is row/column major and using that in your calculation makes a huge difference.
• Bounds checks and compiler optimizations make a tremendous difference, especially in Chapel and D.
• Enabling SIMD in D and Julia made a contribution to the performance. In D this was done using the -mcpu=native flag, and in Julia this was done using the @simd macro.

In terms of language-specific issues, getting to performant code in Chapel was the most challenging, and the Chapel code changed the most from easy-to-read array operations to using pointers and guided iterations. But on the compiler side it was relatively easy to add --fast and get a large performance boost.

The D code changed very little, and most of the performance was gained by the choice of compiler and its optimization flags. D’s LDC compiler is rich in terms of options for performance optimization. It has 8 -O optimization levels, but some are repetitions of others. For instance, -O, -O3, and -O5 are identical, and there are myriad other flags that affect performance in various ways. In this case the flags used were -O5 --boundscheck=off –ffast-math, representing aggressive compiler optimizations, bounds checking, and LLVM’s fast-math, and -mcpu=native to enable CPU vectorization instructions.

In Julia the macro changes discussed previously markedly improved the performance, but they were not too intrusive. I tried changing the optimization -O level, but this did not improve performance.

Quality of life

This section examines the relative pros and cons around the convenience and ease of use of each language. People underestimate the effort it takes to use a language day-to-day; the support and infrastructure required is significant, so it is worth comparing various facets of each language. Readers seeking to avoid the TLDR should scroll to the end of this section for the table comparing the language features discussed here. Every effort has been made to be as objective as possible, but comparing programming languages is difficult, bias prone, and contentious, so read this section with that in mind. Some elements looked at, such as arrays, are from the “data science”/technical/scientific computing point of view, and others are more general.

Interactivity

Programmers want a fast code/compile/result loop during development to quickly observe results and outputs in order to make progress or necessary changes. Julia’s interpreter is hands down the best for this and offers a smooth and feature-rich development experience, and D comes a close second. This code/compile/result loop in compilers can be slow even when compiling small code volumes. D has three compilers, the standard DMD compiler, the LLVM-based LDC compiler, and the GCC-based GDC. In this development process, the DMD and LDC compilers were used. DMD has very fast compilation times which is great for development. The LDC compiler is great at creating fast code. Chapel’s compiler is very slow in comparison. To give an example, running Linux’s time command on DMD vs Chapel’s compiler for the kernel matrix code with no optimizations gives us for D:

real	0m0.545s
user	0m0.447s
sys	0m0.101s

Compared with Chapel:

real	0m5.980s
user	0m5.787s
sys	0m0.206s

That’s a large actual and psychological difference, it can make programmers reluctant to check their work and delay the development loop if they have to wait for outputs, especially when source code increases in volume and compilation times become significant.

It is worth mentioning, however, that when developing packages in Julia, compilation times can be very long, and users have noticed that when they load some packages ,compilation times can stretch. So the experience of the development loop in Julia could vary, but in this specific case the process was seamless.

Documentation and examples

One way of comparing documentation in the different languages is to compare them all with Python’s official documentation, which is the gold standard for programming languages. It combines examples with formal definitions and tutorials in a seamless and user-friendly way. Since many programmers are familiar with the Python documentation, this approach gives an idea of how they compare.

Julia’s documentation is the closest to Python’s documentation quality and gives the user a very smooth, detailed, and relatively painless transition into the language. It also has a rich ecosystem of blogs, and topics on many aspects of the language are easy to come by. D’s official documentation is not as good and can be challenging and frustrating, however there is a very good free book “Programming in D” which is a great introduction to the language, but no single book can cover a programming language and there are not many sources for advanced topics. Chapel’s documentation is quite good for getting things done, though examples vary in presence and quality. Often, the programmer needs a lot of knowledge to look in the right place. A good topic for comparison is file I/O libraries in Chapel, D, and Julia. Chapel’s I/O library has too few examples but is relatively clear and straightforward; D’s I/O is kind of spread across a few modules, and documentation is more difficult to follow; Julia’s I/O documentation has lots of examples and is clear and easy to follow.

Perhaps one factor affecting Chapel’s adoption is lack of example—since its arrays have a non-standard interface, the user has to work hard to become familiar with them. Whereas even though D’s documentation may not be as good in places, the language has many similarities to C and C++, so it gets away with more sparse documentation.

Multi-dimensional Array support

“Arrays” here does not refer to native C and C++ style arrays available in D, but mathematical arrays. Julia and Chapel ship with array support and D does not, but it has the Mir library which has multidimensional arrays (ndslice). In the implementation of kernel matrix, I wrote my own matrix object in D, which is not difficult if you understand the principle, but it’s not something a user wants to do. However, D has a linear algebra library called Lubeck which has impressive performance characteristics and interfaces with all the usual BLAS implementations. Julia’s arrays are by far the easiest and most familiar. Chapel’s arrays are more difficult to get started with than Julia’s but are designed to be run on single-core, multi-core, and computer clusters using the same or very similar code, which is a good unique selling point.

Language power

Since Julia is a dynamic programming language, some might say, “well Julia is a dynamic language which is far more permissive than static programming languages, therefore the debate is over”, but it’s more complicated than that. There is power in static type systems. Julia has a type system similar in nature to type systems from static languages, so you can write code as if you were using a static language, but you can do things reserved only for dynamic languages. It has a highly developed generic and meta-programming syntax and powerful macros. It also has a highly flexible object system and multiple dispatch. This mix of features is what makes Julia the most powerful language of the three.

D was intended to be a replacement for C++ and takes very much after C++ (and also borrows from Java), but makes template programming and compile-time evaluation much more user-friendly than in C++. It is a single dispatch language (though multi-methods are available in a package). Instead of macros, D has string and template “mixins” which serve a similar purpose.

Chapel has generic programming support and nascent support for single dispatch OOP, no macro support, and is not yet as mature as D or Julia in these terms.

Concurrency & Parallel Programming

Nowadays, new languages tout support for concurrency and its popular subset, parallelism, but the details vary a lot between languages. Parallelism is more relevant in this example and all three languages deliver. Writing parallel for loops is straightforward in all three languages.

Chapel’s concurrency model has much more emphasis on data parallelism but has tools for task parallelism and ships with support for cluster-based concurrency.

Julia has good support for both concurrency and parallelism.

D has industry strength support for parallelism and concurrency, though its support for threading is much less well documented with examples.

Standard Library

How good is the standard library of all three languages in general? What range of tasks do they allow users to easily tend to? It’s a tough question because library quality and documentation factor in. All three languages have very good standard libraries. D has the most comprehensive standard library, but Julia is a great second, then Chapel, but things are never that simple. For example, a user seeking to write binary I/O may find Julia the easiest to start with; it has the most straightforward, clear interface and documentation, followed by Chapel, and then D. Though in my implementation of an IDX file reader, D’s I/O was the fastest, but then Julia code was easy to write for cases unavailable in the other two languages.

Package Managers & Package Ecosystems

In terms of documentation, usage, and features, D’s Dub package manager is the most comprehensive. D also has a rich package ecosystem in the Dub website, Julia’s package manager runs tightly integrated with GitHub and is a good package system with good documentation. Chapel has a package manager but does not have a highly developed package ecosystem.

C Integration

C interop is easy in all three languages; Chapel has good documentation but is not as well popularised as the others. D’s documentation is better and Julia’s documentation is the most comprehensive. Oddly enough though, none of the languages’ documentation show the commands required to compile your own C code and integrate it with the language, which is an oversight especially when it comes to novices. It is, however, easy to search for and find examples for the compilation process in D and Julia.

Community

All three languages have convenient places where users can ask questions. For Chapel, the easiest place is Gitter, for Julia it’s Discourse (though there is a Julia Gitter), and for D it’s the official website forum. The Julia community is the most active, followed by D, and then Chapel. I’ve found that you’ll get good responses from all three communities, but you’ll probably get quicker answers from the D and Julia communities.

Chapel D Julia
Compilation/Interactivty Slow Fast Best
Documentation & Examples Detailed Patchy Best
Multi-dimensional Arrays Yes Native Only
(library support)
Yes
Language Power Good Great Best
Concurrency & Parallelism Great Great Good
Standard Library Good Great Great
Package Manager & Ecosystem Nascent Best Great
C Integration Great Great Great
Community Small Vibrant Largest

Table for quality of life features in Chapel, D & Julia

Summary

If you are a novice programmer writing numerical algorithms and doing calculations based in scientific computing and want a fast language that’s easy to use, Julia is your best bet. If you are an experienced programmer working in the same space, Julia is still a great option. If you specifically want a more conventional, “industrial strength”, statically compiled, high-performance language with all the “bells and whistles”, but want something more productive, safer, and less painful than C++, then D is your best bet. You can write “anything” in D and get great performance from its compilers. If you need to get array calculations happening on clusters, then Chapel is probably the easiest place to go.

In terms of raw performance on this task, D was the winner, clearly performing better in 5 out of the 9 kernels benchmarked. This exercise reveals that Julia’s label as a high-performance language is more than just hype—it has held it’s own against highly competitive languages. It was harder than expected to get competitive performance from Chapel—it took a lot of investigation from the Chapel team to come up with the current solution. However, as the Chapel language matures we could see further improvement.

Lomuto’s Comeback

The Continental Club in Austin, Texas, USA
Sunday, January 5, 1987

“Thank you for your kind invitation, Mr. Lomuto. I will soon return to England so this is quite timely.”

“And thanks for agreeing to meeting me, Mister… Sir… Charles… A.R… Hoare. It’s a great honor. I don’t even know how to address you. Were you knighted?”

“Call me Tony, and if it’s not too much imposition please allow me to call you Nico.”

On the surface, a banal scene—two men enjoying a whiskey. However, a closer look revealed a number of intriguing details. For starters, a tension you could cut with a knife.

Dressed in a perfectly tailored four-piece suit worn with the nonchalance only an Englishman could pull off, Tony Hoare was as British as a cup of tea. His resigned grimaces as he was sipping from his glass spoke volumes about his opinion of Bourbon versus Scotch. On the other side of the small table, Nico Lomuto couldn’t have been more different: a casually dressed coder enjoying his whiskey with Coca-Cola (a matter so outrageous that Tony had decided early on to studiously pretend not to notice, as he would when confronted with ripe body odor or an offensive tattoo), in a sort of relaxed awe at the sight of the Computer Science giant he had just met.

“Listen, Tony,” Nico said as the chit chat petered off, “about that partitioning algorithm. I never meant to publish or—”

“Oh? Yes, yes, the partitioning algorithm.” Tony’s eyebrows rose with feigned surprise, as if it had escaped his mind that every paper and book on quicksort in the past five years mentioned their names together. It was obviously the one thing connecting the two men and the motivation of the meeting, but Tony, the perfect gentleman, could talk about the weather for hours with a pink elephant in the room if his conversation partner didn’t bring it up.

“Yeah, that partitioning algorithm that keeps on getting mentioned together with yours,” Nico continued. “I’m not much of an algorithms theorist. I’m working on Ada, and this entire thing about my partition scheme is a distraction. The bothersome part about it”—Nico was speaking in the forthcoming tone of a man with nothing to hide—”is that it’s not even a better algorithm. My partitioning scheme will always do the same number of comparisons and at least as many swaps as yours. In the worst case, mine does n additional swaps—n! I can’t understand why they keep on mentioning the blessed thing. It’s out of my hands now. I can’t tell them what algorithms to teach and publish. It’s like bubblesort. Whenever anyone mentions quicksort, there’s some chowderhead—or should I say bubblehead—in the audience going, yes, I also heard of the bubblesort algorithm. Makes my blood curdle.”

Nico sighed. Tony nodded. Mutual values. Rapport filled the air in between as suddenly, quietly, and pleasantly as the smell of cookies out of the oven. A few seconds went by. Jack and Coke sip. On the other side of the table, Bourbon sip, resigned grimace.

Tony spoke with the carefully chosen words of a scientist who wants to leave no hypothesis unexplored. “I understand, Nico. Yet please consider the following. Your algorithm is simple and regular, moves in only one direction, and does at most one swap per step. That may be appropriate for some future machines that…”

“No matter the machine, more swaps can’t be better than fewer swaps. It’s common sense,” Nico said, peremptorily.

“I would not be so sure. Computers do not have common sense. Computers are surprising. It stands to reason they’ll continue to be. Well, how about we enjoy this evening. Nothing like a good conversation in a quiet club.”

“Yeah. Cheers. This is a fun place. I hear they’ll have live country music soon.”

“Charming.” Somewhat to his own surprise, Tony mustered a polite smile.

Chestnut Hill, Massachusetts, USA
Present Day

I’ve carried an unconfessed addiction to the sorting problem for many years. Wasn’t that difficult to hide—to a good extent, an obsessive inclination to studying sorting is a socially tolerated déformation professionnelle; no doubt many a programmer has spent a few late nights trying yet another sorting idea that’s going to be so much better than the others. So nobody raised an eyebrow when I wrote about sorting all the way back in 2002 (ever heard about “fit pivot?” Of course you didn’t). There was no intervention organized when I wrote D’s std.sort, which turned out to be sometimes quadratic (and has been thankfully fixed since). No scorn even when I wrote an academic paper on the selection problem (sort’s cousin) as an unaffiliated outsider, which even the conference organizers said was quite a trick. And no public outrage when I spoke about sorting at CppCon 2019. Coders understand.

So, I manage. You know what they say—one day at a time. Yet I did feel a tinge of excitement when I saw the title of a recent paper: “Branch Mispredictions Don’t Affect Mergesort.” Such an intriguing title. To start with, are branch mispredictions expected to affect mergesort? I don’t have much of an idea, mainly because everybody and their cat is using quicksort, not mergesort, so the latter hasn’t really been at the center of my focus. But hey, I don’t even need to worry about it because the title resolutely asserts that that problem I didn’t know I was supposed to worry about, I don’t need to worry about after all. So in a way the title cancels itself out. Yet I did read the paper (and recommend you do the same) and among many interesting insights, there was one that caught my attention: Lomuto’s partitioning scheme was discussed as a serious contender (against the universally-used Hoare partition) from an efficiency perspective. Efficiency!

It turns out modern computing architecture does, sometimes, violate common sense.

To Partition, Perchance to Sort

Let’s first recap the two partitioning schemes. Given an array and a pivot element, to partition the array means to arrange elements of the array such that all elements smaller than or equal to the pivot are on the left, and elements greater than or equal to the pivot are on the right. The final position of the pivot would be at the border. (If there are several equivalent pivot values that final pivot position may vary, with important practical consequences; for this discussion, however, we can assume that all array values are distinct.)

Lomuto’s partitioning scheme walks the array left to right maintaining a “read” position and a “write” position, both initially at 0. For each element read, if the value seen by the “read head” is greater than the pivot, it gets skipped (with the read head moving to the next position). Otherwise, the value at the read head is swapped with that at the write head, and both heads advance by one position. When the read head is done, the position of the write head defines the partition. Refer to the nice animation below (from Wikipedia user Mastremo, used unmodified under the CC-BY-SA 3.0 license).

The problem with Lomuto’s partition is that it may do unnecessary swaps. Consider the extreme case of an array with only the leftmost element greater than the pivot. That element will be awkwardly moved to the right one position per iteration step, in a manner not unlike, um, bubblesort.

Hoare’s partitioning scheme elegantly solves that issue by iterating concomitantly from both ends of the array with two “read/write heads”. They skip elements that are already appropriately placed (less than the pivot on the left, greater than the pivot on the right), and swap only one smaller element from the left with one greater element from the right. When the two heads meet, the array is partitioned around the meet point. The extreme case described above is handled with a single swap. Most contemporary implementations of quicksort use Hoare partition, for obvious reasons: it does as many comparisons as the Lomuto partition and fewer swaps.

Given that Hoare partition clearly does less work than Lomuto partition, the question would be why ever teach or use the latter at all. Alexander Stepanov, the creator of the STL, authored a great discussion about partitioning and makes a genericity argument: Lomuto partition only needs forward iterators, whereas Hoare partition requires bidirectional iterators. That’s a valuable insight, albeit of limited practical utility: yes, you could use Lomuto’s partition on singly-linked lists, but most of the time you partition for quicksort’s sake, and you don’t want to quicksort singly-linked lists; mergesort would be the algorithm of choice.

Yet a very practical—and very surprising—argument does exist, and is the punchline of this article: implemented in a branch-free manner, Lomuto partition is a lot faster than Hoare partition on random data. Given that quicksort spends most of its time partitioning, it follows that we are looking at a hefty improvement of quicksort (yes, I am talking about industrial strength implementations for C++ and D) by replacing its partitioning algorithm with one that literally does more work.

Time to Spin Some Code

To see how the cookie crumbles, let’s take a look at a careful implementation of Hoare partition. To eliminate all extraneous details, the code in this article is written for long as the element type and uses raw pointers. It compiles and runs the same with a C++ or D compiler. This article will carry along implementations of all routines in both languages because much research literature measures algorithm performance using C++’s std::sort as an important baseline.

/**
Partition using the minimum of the first and last element as pivot.
Returns: a pointer to the final position of the pivot.
*/
long* hoare_partition(long* first, long* last) {
assert(first <= last);
if (last - first < 2)
return first; // nothing interesting to do
--last;
if (*first > *last)
swap(*first, *last);
auto pivot_pos = first;
auto pivot = *pivot_pos;
for (;;) {
++first;
auto f = *first;
while (f < pivot)
f = *++first;
auto l = *last;
while (pivot < l)
l = *--last;
if (first >= last)
break;
*first = l;
*last = f;
--last;
}
--first;
swap(*first, *pivot_pos);
return first;
}


(You may find the choice of pivot a bit odd, but not to worry: usually it’s a more sophisticated scheme—such as median-of-3—but what’s important to the core loop is that the pivot is not the largest element of the array. That allows the core loop to omit a number of limit conditions without running off array bounds.)

There are a lot of good things to say about the efficiency of this implementation (which you’re likely to find, with minor details changed, in implementations of the C++ or D standard library). You could tell the code above was written by people who live trim lives. People who keep their nails clean, show up when they say they’ll show up, and call Mom regularly. They do a wushu routine every morning and don’t let computer cycles go to waste. That code has no slack in it. The generated Intel assembly is remarkably tight and virtually identical for C++ and D. It only varies across backends, with LLVM at a slight code size advantage (see clang and ldc) over gcc (see g++ and gdc).

The initial implementation of Lomuto’s partition shown below works well for exposition, but is sloppy from an efficiency perspective:

/**
Choose the pivot as the first element, then partition.
Returns: a pointer to the final position of the pivot.
*/
long* lomuto_partition_naive(long* first, long* last) {
assert(first <= last);
if (last - first < 2)
return first; // nothing interesting to do
auto pivot_pos = first;
auto pivot = *first;
++first;
++first;
}
}
--first;
swap(*first, *pivot_pos);
return first;
}

For starters, the code above will do a lot of silly no-op swaps (array element with itself) if a bunch of elements on the left of the array are greater than the pivot. All that time first==write, so swapping *first with *write is unnecessary and wasteful. Let’s fix that with a pre-processing loop that skips the uninteresting initial portion:

/**
Partition using the minimum of the first and last element as pivot.
Returns: a pointer to the final position of the pivot.
*/
long* lomuto_partition(long* first, long* last) {
assert(first <= last);
if (last - first < 2)
return first; // nothing interesting to do
--last;
if (*first > *last)
swap(*first, *last);
auto pivot_pos = first;
auto pivot = *first;
// Prelude: position first (the write head) on the first element
// larger than the pivot.
do {
++first;
} while (*first < pivot);
assert(first <= last);
// Main course.
if (x < pivot) {
*first = x;
++first;
}
}
// Put the pivot where it belongs.
assert(*first >= pivot);
--first;
*pivot_pos = *first;
*first = pivot;
return first;
}

The function now chooses the pivot as the smallest of first and last element, just like hoare_partition. I also made another small change—instead of using the swap routine, let’s use explicit assignments. The optimizer takes care of that automatically (enregistering plus register allocation for the win), but expressing it in source helps us see the relatively expensive array reads and array writes. Now for the interesting part. Let’s focus on the core loop:

for (auto read = first + 1; read < last; ++read) {
if (x < pivot) {
*first = x;
++first;
}
}


Let’s think statistics. There are two conditionals in this loop: read < last and x < pivot. How predictable are they? Well, the first one is eminently predictable—you can reliably predict it will always be true, and you’ll only be wrong once no matter how large the array is. Compiler writers and hardware designers know this, and design the fastest path under the assumption loops will continue. (Gift idea for your Intel engineer friend: a doormat that reads “The Backward Branch Is Always Taken.”) The CPU will speculatively start executing the next iteration of the loop even before having decided whether the loop should continue. That work will be thrown away only once, at the end of the loop. That’s the magic of speculative execution.

Things are quite a bit less pleasant with the second test, x < pivot. If you assume random data and a randomly-chosen pivot, it could go either way with equal probability. That means speculative execution is not effective at all, which is very bad for efficiency. How bad? In a deeply pipelined architecture (as all are today), failed speculation means the work done by several pipeline stages needs to be thrown away, which in turn propagates a bubble of uselessness through the pipeline (think air bubbles in a garden hose). If these bubbles occur too frequently, the loop produces results at only a fraction of the attainable bandwidth. As the measurements section will show, that one wasted speculation takes away about 30% of the potential speed.

How to improve on this problem? Here’s an idea: instead of making decisions that control the flow of execution, we write the code in a straight-line manner and we incorporate the decisions as integers that guide the data flow by means of carefully chosen array indexing. Be prepared—this will force us to do silly things. For example, instead of doing two conditional writes per iteration, we’ll do exactly two writes per iteration no matter what. If the writes were not needed, we’ll overwrite words in memory with their own value. (Did I mention “silly things”?) To prepare the code for all that, let’s rewrite it as follows:

for (auto read = first + 1; read < last; ++read) {
if (x < pivot) {
*first = x;
first += 1;
} else {
*first = *first;
first += 0;
}
}


Now the two branches of the loop are almost identical save for the data. The code is still correct (albeit odd) because on the else branch it needlessly writes *read over itself and *first over itself. How do we now unify the two branches? Doing so in an efficient manner takes a bit of pondering and experimentation. Conditionally incrementing first is easy because we can always write first += x < pivot. Piece of cake. The two memory writes are more difficult, but the basic idea is to take the difference between pointers and use indexing. Here’s the code. Take a minute to think it over:

for (; read < last; ++read) {
auto smaller = -int(x < pivot);
auto delta = smaller & (read - first);
first[delta] = *first;
first -= smaller;
}


To paraphrase a famous Latin aphorism, explanatio longa, codex brevis est. Short is the code, long is the ‘splanation. The initialization of smaller with -int(x < pivot) looks odd but has a good reason: smaller can serve as both an integral (0 or -1) used with the usual arithmetic and also as a mask that is 0 or 0xFFFFFFFF (i.e. bits set all to 0 or all to 1) used with bitwise operations. We will use that mask to allow or obliterate another integral in the next line that computes delta. If x < pivotsmaller is all ones and delta gets initialized to read - first. Subsequently, delta is used in first[delta] and read[-delta], which really are syntactic sugar for *(first + delta) and *(read - delta), respectively. If we substitute delta in those expressions, we obtain *(first + (read - first)) and *(read - (read - first)), respectively.

The last line, first -= smaller, is trivial: if x < pivot, subtract -1 from first, which is the same as incrementing first. Otherwise, subtract 0 from first, effectively leaving first alone. Nicely done.

With x < pivot substituted to 1, the calculation done in the body of the loop becomes:

auto x = *read;
int smaller = -1;
auto delta = -1 & (read - first);
*(first + (read - first)) = *first;
first -= -1;


Kind of magically the two pointer expressions simplify down to *read and *first, so the two assignments effect a swap (recall that x had been just initialized with *read). Exactly what we did in the true branch of the test in the initial version!

If x < pivot is false, delta gets initialized to zero and the loop body works as follows:

auto x = *read;
int smaller = 0;
auto delta = 0 & (read - first);
*(first + 0) = *first;
first -= 0;


This time things are simpler: *first gets written over itself, *read also gets written over itself, and first is left alone. The code has no effect whatsoever, which is exactly what we wanted.

Let’s now take a look at the entire function:

long* lomuto_partition_branchfree(long* first, long* last) {
assert(first <= last);
if (last - first < 2)
return first; // nothing interesting to do
--last;
if (*first > *last)
swap(*first, *last);
auto pivot_pos = first;
auto pivot = *first;
do {
++first;
assert(first <= last);
} while (*first < pivot);
auto smaller = -int(x < pivot);
auto delta = smaller & (read - first);
first[delta] = *first;
first -= smaller;
}
assert(*first >= pivot);
--first;
*pivot_pos = *first;
*first = pivot;
return first;
}

A beaut, isn’t she? Even more beautiful is the generated code—take a look at clang/ldc and g++/gdc. Again, there is a bit of variation across backends.

Experiments and Results

All code is available at https://github.com/andralex/lomuto.

To draw a fair comparison between the two partitioning schemes, I’ve put together a quicksort implementation. This is because most often a partition would be used during quicksort. For the sake of simplification, the test implementation omits a few details present in industrial quicksort implementations, which need to worry about a variety of data shapes (partially presorted ascending or descending, with local patterns, with many duplicates, etc). Library implementations choose the pivot carefully from a sample of usually 3-9 elements, possibly with randomization, and have means to detect and avoid pathological inputs, most often by using Introsort.

In our benchmark, for simplicity, we only test against random data, and the choice of pivot is simply the minimum of first and last element. This is without loss of generality; better pivot choices and adding introspection are done the same way regardless of the partitioning method. Here, the focus is to compare the performance of Hoare vs. Lomuto vs. branch-free Lomuto.

The charts below plot the time taken by one sorting operation depending on the input size. The machine used is an Intel i7-4790 at 3600 MHz with a 256KB/1MB/8MB cache hierarchy running Ubuntu 20.04. All builds were for maximum speed (-O3, no assertions, no boundcheck for the D language). The input is a pseudorandom permutation of longs with the same seed for all languages and platforms. To eliminate noise, the minimum is taken across several epochs.

The results for the D language are shown below, including the standard library’s std.sort as a baseline.

Chart by Visualizer
Chart by Visualizer

The results for C++ are shown in the plots below. Again the standard library implementation std::sort is included as a baseline.

Chart by Visualizer
Chart by Visualizer

One important measurement is the CPU utilization efficiency, shown by Intel VTune as “the micropipe”, a diagram illustrating inefficiencies in resource utilization. VTune’s reports are very detailed but the micropipe gives a quick notion of where the work goes. To interpret a micropipe, think of it as a funnel. The narrower the exit (on the right), the slower the actual execution as a fraction of potential speed.

The micropipes shown below correspond to the Hoare partition, Lomuto partition (in the traditional implementation), and branch-free Lomuto partition. The first two throw away about 30% of all work as bad speculation. In contrast, the Lomuto branchless partition wastes no work on speculation, which allows it a better efficiency in spite of more memory writes.

Intel VTune pipe efficiency diagram for the Hoare partition. A large percentage of work is wasted on failed speculation.

Intel VTune pipe efficiency diagram for the traditional “branchy” Lomuto partition, featuring about as much failed speculation as the Hoare partition.

Intel VTune pipe efficiency diagram for the Lomuto branch-free partition. Virtually no work is wasted on failed speculation, leading to a much better efficiency.

Discussion

The four versions (two languages times two backends) exhibit slight variations due to differences in standard library implementations and backend versions. It is not surprising that minute variations in generated code are liable to create measurable differences in execution speed.

As expected, the “branchy” Lomuto partition compares poorly with Hoare partition, especially for large input sizes. Both are within the same league as the standard library implementation of the sort routine. Sorting using the branchless Lomuto partition, however, is the clear winner by a large margin regardless of platform, backend, and input size.

It has become increasingly clear during the past few years that algorithm analysis—and proposals for improvements derived from it—cannot be done solely with pen and paper using stylized computer architectures with simplistic cost models. The efficiency of sorting is determined by a lot more than counting the comparisons and swaps—at least, it seems, the predictability of comparisons must be taken into account. In the future, I am hopeful that better models of computation will allow researchers to rein in the complexity. For the time being, it seems, algorithm optimization remains hopelessly experimental.

For sorting in particular, Lomuto is definitely back and should be considered by industrial implementations of quicksort on architectures with speculative execution.

Acknowledgments

Many thanks are due to Amr Elmasry, Jyrki Katajainen, and Max Stenmark for an inspirational paper. I haven’t yet been able to engineer a mergesort implementation (the main result of their paper) that beats the best quicksort described here, but I’m working on it. (Sorry, Sorters Anonymous… I’m still off the wagon.) I’d like to thank to Michael Parker and the commentators at the end of this post for fixing many of my non-native-speaker-isms. (Why do they say “pretend not to notice” and “pretend to not notice”? I never remember the right one.) Of course, most of the credit is due to Nico Lomuto, who defined an algorithm that hasn’t just stood the test of time—it passed it.

Interfacing D with C: Arrays and Functions (Arrays Part 2)

This post is part of an ongoing series on working with both D and C in the same project. The previous post explored the differences in array declaration and initialization. This post takes the next step: declaring and calling C functions that take arrays as parameters.

Arrays and C function declarations

Using C libraries in D is extremely easy. Most of the time, things work exactly as one would expect, but as we saw in the previous article there can be subtle differences. When working with C functions that expect arrays, it’s necessary to fully understand these differences.

The most straightforward and common way of declaring a C function that accepts an array as a parameter is to to use a pointer in the parameter list. For example, this hypothetical C function:

void f0(int *arr);

In C, any array of int can be passed to this function no matter how it was declared. Given int a[], int b[3], or int *c, the function calls f0(a), f0(b), and f0(c) are all the same: a pointer to the first element of each array is passed to the function. Or using the lingo of C programmers, arrays decay to pointers

Typically, in a function like f0, the implementer will expect the array to have been terminated with a marker appropriate to the context. For example, strings in C are arrays of char that are terminated with the \0 character (we’ll look at D strings vs. C strings in a future post). This is necessary because, without that character, the implementation of f0 has no way to know which element in the array is the last one. Sometimes, a function is simply documented to expect a certain length, either in comments or in the function name, e.g., a vector3f_add(float *vec) will expect that vec points to exactly 3 elements. Another option is to require the length of the array as a separate argument:

void f1(int *arr, size_t len);

None of these approaches is foolproof. If f0 receives an array with no end marker or which is shorter than documented, or if f1 receives an array with an actual length shorter than len, then the door is open for memory corruption. D arrays take this possibility into account, making it much easier to avoid such problems. But again, even D’s safety features aren’t 100% foolproof when calling C functions from D.

There are other, less common, ways array parameters may be declared in C:

void f2(int arr[]);
void f3(int arr[9]);
void f4(int arr[static 9]);

Although these parameters are declared using C’s array syntax, they boil down to the exact same function signature as f0 because of the aforementioned pointer decay. The [9] in f3 triggers no special enforcement by the compiler; arr is still effectively a pointer to int with unknown length. The [9] serves as documentation of what the function expects, and the implementation cannot rely on the array having nine elements.

The only potential difference is in f4. The static added to the declaration tells the compiler that the function must take an array of, in this case, at least nine elements. It could have more than nine, but it can’t have fewer. That also rules out null pointers. The problem is, this isn’t necessarily enforced. Depending on which C compiler you use, if you shortchange the function and send it less than nine elements you might see warnings if they are enabled, but the compiler might not complain at all. (I haven’t tested current compilers for this article to see if any are actually reporting errors for this, or which ones provide warnings.)

The behavior of C compilers doesn’t matter from the D side. All we need be concerned with is declaring these functions appropriately so that we can call them from D such that there are no crashes or unexpected results. Because they are all effectively the same, we could declare them all in D like so:

extern(C):
void f0(int* arr);
void f1(int* arr, size_t len);
void f2(int* arr);
void f3(int* arr);
void f4(int* arr);

But just because we can do a thing doesn’t mean we should. Consider these alternative declarations of f2, f3, and f4:

extern(C):
void f2(int[] arr);
void f3(int[9] arr);
void f4(int[9] arr);

Are there any consequences of taking this approach? The answer is yes, but that doesn’t mean we should default to int* in each case. To understand why, we need first to explore the innards of D arrays.

The anatomy of a D array

The previous article showed that D makes a distinction between dynamic and static arrays:

int[] a0;
int[9] a1;

a0 is a dynamic array and a1 is a static array. Both have the properties .ptr and .length. Both may be indexed using the same syntax. But there are some key differences between them.

Dynamic arrays

Dynamic arrays are usually allocated on the heap (though that isn’t a requirement). In the above case, no memory for a0 has been allocated. It would need to be initialized with memory allocated via new or malloc, or some other allocator, or with an array literal. Because a0 is uninitialized, a0.ptr is null and a0.length is 0.

A dynamic array in D is an aggregate type that contains the two properties as members. Something like this:

struct DynamicArray {
size_t length;
size_t ptr;
}

In other words, a dynamic array is essentially a reference type, with the pointer/length pair serving as a handle that refers to the elements in the memory address contained in the ptr member. Every built-in D type has a .sizeof property, so if we take a0.sizeof, we’ll find it to be 8 on 32-bit systems, where size_t is a 4-byte uint, and 16 on 64-bit systems, where size_t is an 8-byte ulong. In short, it’s the size of the handle and not the cumulative size of the array elements.

Static arrays

Static arrays are generally allocated on the stack. In the declaration of a1, stack space is allocated for nine int values, all of which are initialized to int.init (which is 0) by default. Because a1 is initialized, a1.ptr points to the allocated space and a1.length is 9. Although these two properties are the same as those of the dynamic array, the implementation details differ.

A static array is a value type, with the value being all of its elements. So given the declaration of a1 above, its nine int elements indicate that a1.sizeof is 9 * int.sizeof, or 36. The .length property is a compile-time constant that never changes, and the .ptr property, though not readable at compile time, is also a constant that never changes (it’s not even an lvalue, which means it’s impossible to make it point somewhere else).

These implementation details are why we must pay attention when we cut and paste C array declarations into D source modules.

Passing D arrays to C

Let’s go back to the declaration of f2 in C and give it an implementation:

void f2(int arr[]) {
for(int i=0; i<3; ++i)
printf("%d\n", arr[i]);
}

A naïve declaration in D:

extern(C) void f2(int[]);

void main() {
int[] a = [10, 20, 30];
f2(a);
}

I say naïve because this is never the right answer. Compiling f2.c with df2.d on Windows (cl /c f2.c in the “x64 Native Tools” command prompt for Visual Studio, followed by dmd -m64 df2.d f2.obj), then running df2.exe, shows me the following output:

3
0
1970470928

There is no compiler error because the declaration of f2 is pefectly valid D. The extern(C) indicates that this function uses the cdecl calling convention. Calling conventions affect the way arguments are passed to functions and how the function’s symbol is mangled. In this case, the symbol will be either _f2 or f2 (other calling conventions, like stdcallextern(Windows) in D—have different mangling schemes). The declaration still has to be valid D. (In fact, any D function can be marked as extern(C), something which is necessary when creating a D library that will be called from other languages.)

There is also no linker error. DMD is calling out to the system linker (in this case, Microsoft’s link.exe), the same linker used by the system’s C and C++ compilers. That means the linker has no special knowledge about D functions. All it knows is that there is a call to a symbol, f2 or _f2, that needs to be linked with the implementation. Since the type and number of parameters are not mangled into the symbol name, the linker will happily link with any matching symbol it finds (which, by the way, is the same thing it would do if a C program tried to call a C function which was declared with an incorrect parameter list).

The C function is expecting a single pointer as an argument, but it’s instead receiving two values: the array length followed by the array pointer.

The moral of this story is that any C function with array parameters declared using array syntax, like int[], should be declared to accept pointers in D. Change the D source to the following and recompile using the same command line as before (there’s no need to recompile the C file):

extern(C) void f2(int*);

void main() {
int[] a = [10, 20, 30];
f2(a.ptr);
}

Note the use of a.ptr. It’s an error to try to pass a D array argument where a pointer is expected (with one very special exception, string literals, which I’ll cover in the next article in this series), so the array’s .ptr property must be used instead.

The story for f3 and f4 is similar:

void f3(int arr[9]);
void f4(int arr[static 9]);

Remember, int[9] in D is a static array, not a dynamic array. The following do not match the C declarations:

void f3(int[9]);
void f4(int[9]);

Try it yourself. The C implementation:

void f3(int arr[9]) {
for(int i=0; i<9; ++i)
printf("%d\n", arr[i]);
}

And the D implementation:

extern(C) void f3(int[9]);

void main() {
int[9] a = [10, 20, 30, 40, 50, 60, 70, 80, 90];
f3(a);
}

This is likely to crash, depending on the system. Rather than passing a pointer to the array, this code is instead passing all nine array elements by value! Now consider a C library that does something like this:

typedef float[16] mat4f;
void do_stuff(mat4f mat);

Generally, when writing D bindings to C libraries, it’s a good idea to keep the same interface as the C library. But if the above is translated like the following in D:

alias mat4f = float[16];
extern(C) void do_stuff(mat4f);

The sixteen floats will be passed to do_stuff every time it’s called. The same for all functions that take a mat4f parameter. One solution is just to do the same as in the int[] case and declare the function to take a pointer. However, that’s no better than C, as it allows the function to be called with an array that has fewer elements than expected. We can’t do anything about that in the int[] case, but that will usually be accompanied by a length parameter on the C side anyway. C functions that take typedef’d types like mat4f usually don’t have a length parameter and rely on the caller to get it right.

In D, we can do better:

void do_stuff(ref mat4f);

Not only does this match the API implementor’s intent, the compiler will guarantee that any arrays passed to do_stuff are static float arrays with 16 elements. Since a ref parameter is just a pointer under the hood, all is as it should be on the C side.

With that, we can rewrite the f3 example:

extern(C) void f3(ref int[9]);

void main() {
int[9] a = [10, 20, 30, 40, 50, 60, 70, 80, 90];
f3(a);
}`

Conclusion

Most of the time, when interfacing with C from D, the C API declarations and any example code can be copied verbatim in D. But most of the time is not all of the time, so care must be taken to account for those exceptional cases. As we saw in the previous article, carelessness when declaring array variables can usually be caught by the compiler. As this article shows, the same is not the case for C function declarations. Interfacing D with C requires the same care as when writing C code.

In the next article in this series, we’ll look at mixing D strings and C strings in the same program and some of the pitfalls that may arise. In the meantime, Steven Schveighoffer’s excellent article, “D Slices”, is a great place to start for more details about D arrays.

Thanks to Walter Bright and Átila Neves for their valuable feedback on this article.