Category Archives: Guest Posts

D Summer School v3

The third edition of the D Summer School, held at University POLITEHNICA of Bucharest, took place from the 5th to the 25th of July. It was three weeks of boot camping bachelor students into the basics of D across eight sessions of hands-on workshops and a hackathon. We will describe our experience in organizing the program, teaching the students, and trying to integrate them into the D community.

Recap from past editions

For the first two editions we had the following process:

  • we started advertising the summer school two months before the event;
  • we selected students from among the applicants;
  • the students had to complete a project during the summer school;
  • we helped students improve their projects during the hackathon;
  • we collected feedback at the end.

For more information, we recommend our previous article on the first edition.

DSSv3

Pre-event actions

In contrast to previous years, this time we started marketing the event very early in February, five months before the start date. We used the most influential vectors we could to promote it: the most popular professors. We nagged them ceaselessly to promote DSSv3 during their courses. The results were spectacular: we received 86 student applications (as opposed to 25 in the previous years). Since this was an online version of the summer school, we decided to drop the selection process and open the door to everyone. This had the added benefit that we didn’t have to conduct interviews with everyone and a wider range of students had a chance to be introduced to D.

To cope with the larger number of participants, we had to grow our team. Former Symmetry Autumn of Code participants Robert Aron and Adela Vais, and Symmetry employee Alexandru Militaru, agreed to join us. As such, we were able to diversify our teaching materials and raise the overall quality of the presentations.

The teaching materials were mostly the same as in the previous years; we simply reshuffled the order to have an incremental level of complexity and added a lab, “WebApp Tutorial using Vibe.d”. Besides this, we also changed the project competition. During the previous editions of DSS we found that students were not very engaged with the project assignment, so this time we came up with something different. We created a Dlang Bug Fix Competition: the top three students who had the largest number of merged PRs that solved a Bugzilla issue would win Raspberry Pi 400 kits (you may have noticed the “DSSv3” labeled PRs on our three main repositories). You may think that contributing to a programming language is a scary task, however, the truth is there are dozens of approachable, easy-to-fix issues that give instant gratification to the contributor.

With all the planning into place, we were now ready to start DSSv3.

The teaching sessions

A teaching session is comprised of an hour-long lecture and two hours of hands-on exercises. This year, we abandoned the slides in favor of tutorial-like examples. Since everything was online, we simply shared our screen and highlighted the different aspects of D in a practical way by directly compiling and running examples. This made the lessons more interactive as students enthusiastically asked “what happens if…?” questions, and we could easily demonstrate the results.

For the exercises, we followed a team-play system where students were grouped in teams of four, and they worked together to solve their tasks. This made it easier for us to organize everything on the Teams platform (we would enter rooms of four students instead of talking students individually) and it encouraged them to help each other.

From among the 86 applicants, we had an average of 35 students attending the sessions, with a record high of 56 (“Introduction to D”) and a low of 25 (“C\C++ Interoperability and Tooling”). It seems that from the first lecture to the last, almost half of the students abandon the course. This may seem like a grim figure, but note that we had students from all types of backgrounds, some of them in their first year at university. Since we are teaching subjects like memory management and multithreading, it’s only natural that some of them will be lost along the way. Regardless, the lowest number of attendees was higher than the highest number from previous years.

The hackathon was attended by eight people. Again, a low figure, but that was not a surprise. Keep in mind that the majority of the participants had never made an open-source contribution. We expected that only the best students would manage to contribute. One other factor that may have influenced the low number was our uninspired decision to organize the hackathon on a Sunday; several people noted in our feedback form that they would have participated had they not had other plans. The result of the hackathon:

  • 9 PRs submitted to Phobos: 5 were merged, 2 were closed but led to closing the associated Bugzilla entries, and 2 were rejected
  • 1 PR submitted to DRuntime that was merged
  • 4 PRs submitted to DMD: 2 were merged, 2 were rejected

We had hoped that students would submit PRs before the hackathon, but we were wrong. It seems that students should be assisted when making their first PR.

The winners of the hackathon were:

  1. danandrei279 with 3 PRs merged
  2. vladchicos with 2 PRs merged
  3. lucica28 with 1 PR merged

Feedback

We asked the students to fill out a feedback form, and we received 15 responses. It is highly possible that the results are biased since the feedback form was available at the end of the hackathon; by that time some students had already dropped out. Although it would have been great to understand their perspective, we still had valuable feedback on what went well and what could be improved.

From aggregating the results, we have the following conclusions.

  • The difficulty of DSS is perceived as being high. Those who are well prepared love it, but those who don’t have too much programming experience are lost along the way.
  • The introductory courses are much more popular than advanced ones such as “C\C++ interop” and “Multithreading in D”.
  • The hackathon was appreciated by hardcore programmers (a small percentage of the total number of attendees), but the rest were intimidated by it.
  • Students appreciated the relaxed interactive atmosphere of the sessions, with some commenting: “The general feel of the summer school was a chill evening hanging out with your friends.”

Overall, DSSv3 generated enthusiasm among the programming geeks, but we still have some work to do to make it attractive to a less savvy audience.

Future plans

Now that our team has grown, we plan to do a bigger restructuring of the course. Given the high drop-out rate, we would like to make the course welcoming for any type of CS student regardless of background or experience. To that end, we are considering creating two tracks for the course: one at a beginner level, and one for more advanced students. That way we can accommodate any type of audience.

Another aspect to think about is the hackathon. We still haven’t found the most appealing project that would motivate students to commit to it. Experience has shown that creating a project from scratch in a language you’ve just learned doesn’t really work (or we haven’t found the adequate project) and contributing to the D language may be intimidating. We are still searching for better solutions, so if you have any ideas, please contact us.

Also, right now, UPB is going through a major redesign of its curriculum. Proposals for new courses are being accepted, and we will forward this course as our choice. There’s a long wait for acceptance, but we’re keeping our fingers crossed.

Conclusions

Overall, we are happy with this year’s edition. We managed to expand our team, grow our reach, and motivate some students to contribute to the language. Even though we still have some work to do to engage less passionate students, we think we are on the right track.

See you next time at DSSv4!

Driving with D

Here is what comes to mind when I think of D: fast, expressive, easy, and… driving? That’s right, I drive with D.

Enter my venerable Holden VZ Ute daily driver. From the factory, it came with a rubbish four-speed automatic gearbox. During 18 months of ownership, I destroyed four gearboxes. I could not afford a new vehicle at the time (I’m a 20-year-old Australian computer science student at Monash University), so I had to get creative. I purchased a rock-solid, bulletproof, six-speed automatic gearbox from another car. But that’s where the solutions ended. To make it work, I had to build my own circuit board, computer system, and firmware to control the solenoids, hydraulics, and clutches inside the gearbox, handle user input, perform shifting decisions, and interface to my car by pretending to be the four-speed automatic.

I’m quite proud of my solution. It can perform a shift in 250 milliseconds, which is great for racing. It has a steep first gear, giving it a swift takeoff. It has given some more powerful cars a run for their money. It’s got flappy paddles, diagnostic data on the screen, and the ability to go ahead and change the way it works whenever I want.

Here’s a very old video of the system working. It’s not representative of the current system—that ghastly blue screen is gone, the speedo works, and shifting has improved.

The computer is split into two parts: the user interface board, which drives an OLED display and uses an STM32F042, and the mainboard, which handles everything else, utilizing an STM32F407. The two cooperate over a CAN bus (Controller Area Network). All the firmware to handle this is written in D.

I picked D (as -betterC) because of its ingenious Uniform Function Call Syntax (UFCS), design by contract, metaprogramming, ease of interfacing with C, unit testing, portability, shared, @safe, and codebase maintenance features. Another bonus is the helpful, welcoming community. It has genuinely been a joy discussing D on the forums, and with the founders and community leaders.

The Advantages of D

Uniform Function Call Syntax (UFCS)
This has made my code significantly clearer. My code can accurately follow the flow of data without polluting my stack with single-use variables, nesting many function calls, or other sorts of clutter.
Here is an example of how you could potentially use it in an ECU (Engine Control Unit):

immutable injectorTime = airStoich(100.kpa, 25.degCelsius)
    .airMass
    .fuelMass((14.7f).afr)
    .fuelMol
    .calculateInjectorWidth;

This is equivalent to:

immutable injectorTime =
    calculateInjectorWidth(
        fuelMol(
            fuelMass(
                airMass(
                    airStoich(kpa(100), degCelsius(25))
                ),
                afr(14.7)
            )
        )
    ); // brackets have been expanded for reading clarity

Please note: the values in this example are hardcoded to simplify the code and demonstrate how UFCS can give a unit of measurement to a value.

Both representations are valid D code; you can use either.

With UFCS, there’s no need to read the code backward or count your brackets, no need to use a gazillion single-use variables. Function calls mirror the flow of data. It’s concise.

Design by Contract
D’s contract programming is quite similar to Ada’s. Functions can be marked with preconditions, designated by in, and postconditions, designated by out. Should a contract fail, an assertion is thrown.

// This demonstrates D’s contract programming for a function.
// It uses the short-hand expression based syntax.
int iHaveAContract(void* ptr)
in(ptr !is null) // this is a precondition, if ptr is null, an assertion is raised
in(ptr !is null, "ptr is null :(") // this is a precondition, if ptr is null, an assertion with the error message "ptr is null :(" is raised
out(result; result > 0) // this is a post condition, it captures the return value as result and tests it
out(result; result > 0, "result was too low") // this is a post condition, it captures the return value as result, tests it, and if it fails, raise an assertion with the message "result was too low"
{
    // normal function code here
}

If you want to do something a bit more complex in your contracts, an alternative syntax is available:

int iHaveAContract(void* ptr)
in {
    assert(ptr !is null); // this is equivalent to in(ptr !is null) from above.
    MyStruct* ms = cast(MyStruct*)ptr; // we can introduce variables local to this contract
    assert(ms.blah == 2, "MyStruct.blah must be 2!"); // test ms.blah, if it fails, raise an assert with an error message
}
out(result) { // captures the return value as result
    int squareIt = result * result;
    assert(squareIt == 4);
}
do { // this designates the function body
    // normal function code here
}

void main(string[] args)
{
    auto i = iHaveAContract(null); // this will violate the contract
}

Structs and classes can include contracts, called invariants, that sanity check the state of an instance for its whole lifetime. Invariants are checked after the constructor is run, before the destructor is run, and before and after public function calls.

struct MyStruct
{
    int blah;

    invariant {
        assert(blah == 2, "blah must always be 2 for some reason!");
    }
    // shorthand:
    invariant(blah == 2, "blah must be 2 for some reason!");

    // The invariant is checked at function entry and exit. If value is anything other than 2, the invariant will fail when the function exits
    void setBlah(int value)
    {
        blah = value;
    }
}

void main(string[] args)
{
    MyStruct s; // invariant is run after construction
    s.setBlah(3); // invariant contract will be violated.
}

Metaprogramming
The Don’t Repeat Yourself (DRY) principle is often touted by programmers. D’s metaprogramming is an incredible tool to achieve that goal. I use it in my CAN bus implementation. For example:

struct CANPacket(ushort ID) {
    enum id = ID;
    ubyte[8] data;
}
alias HeartbeatPacket = CANPacket!10;
alias BeepHornPacket = CANPacket!140;

I’ve got specific aliased types as HeartbeatPacket and BeepHornPacket, but I haven’t needed to repeat any code. They all follow the same underlying structure, so if I modify CANPacket, every alias is also updated.

Maybe you want a more descriptive CAN packet? Mixin templates can help with that!

struct GenericCanPacket
{
    ushort id;
    ubyte[8] data; // 8 bytes to store the CAN packet payload
}

struct HeartbeatPacket
{
    ubyte deviceID; // first byte is the device ID
    ubyte statusID; // second byte is the status
}

To translate a GenericCanPacket to the descriptive HeartbeatPacket, we can use mixin templates.

mixin template CanPacketHelperFunctions(ushort ID)
{
    enum id = ID;

    // typeof(this) means that the return type of readFromGeneric will be the struct this template is instantiated in.
    static typeof(this) readFromGeneric(const ref GenericCanPacket p)
    in(p.id == ID, "Generic packet cannot be converted!")
    {
        // do some cast
    }
}

struct HeartbeatPacket
{
    // the stuff declared in CanPacketHelperFunctions is pretty much copy-pasted (not literally) into here
    mixin CanPacketHelperFunctions!10;

    ubyte deviceID; // first byte is the device ID
    ubyte statusID; // second byte is the status
}

void main()
{
    GenericCanPacket generic;
    generic.id = 10;
    generic.data = [ 2 /* deviceID */, 3 /* statusID */ ];

    HeartbeatPacket heartbeat = HeartbeatPacket.readFromGeneric(generic);
    writeln(HeartbeatPacket.id); // the packet ID is 10
    writeln(heartbeat.deviceID); // 2
    writeln(heratbeat.statusID); // 3
}

The mixin template CanPacketHelperFunctions can be used over and over for all sorts of packet representations, and since it is only declared once, the implementation remains consistent across all types that use it.

Interfacing to C
I frequently must communicate with my microcontroller’s HAL and RTOS; D’s C interface made that a breeze. Just add an extern(C) and it’s good to go.

extern(C) c_setPwm(int solenoid, void* userData); // declaration
c_setPwm(4, null); // usage, pretty easy!

Unit Testing
D’s built-in unit testing has saved me from blowing my foot off a few times. I can run all my unit tests on Windows to guarantee logical correctness, and then build a final target for my microcontroller. Here is an example:

struct MyStruct
{
    int x;
    int squareIt() { return x * x; }
}

unittest
{
    MyStruct s;
    s.x = 9;
    assert(s.squareIt == 9 * 9); // If for some reason the implementation breaks, then the unit test fails
}

Deprecation
Codebases can be ever-changing, and sometimes certain functionality may no longer be considered good practice but must be retained for legacy reasons. D provides a way to explicitly mark this in code. Any use of such deprecated code will trigger a deprecation warning from the compiler. Example:

deprecated struct Example
{
    // ...
}

// This deprecation includes a reason
deprecated("This is the reason for deprecation..") struct ExampleWithMessage
{
    // ...
}

void main()
{
    // This will generate the following compiler warning:
    // "Deprecation: struct `Example` is deprecated"
    Example e;

    // This will generate the following compiler warning:
    // "Deprecation: struct `ExampleWithMessage` is deprecated - This is the reason for deprecation.."
    ExampleWithMessage ewm;
}

Of course, deprecated can be applied to all sorts of things, not just structs.

Portability
Following on from above, D supports a surprisingly large number of targets via GDC and LDC. If it weren’t for D’s portability, I would have had to write my project in C++ (ugh). I use LDC, and cross-compiling can be performed by simply adjusting my command line arguments.

Shared
shared is D’s way of guarding against multi-threaded access of code. It’s not perfect, but I use it as-is, and I think it works well. I do have multiple threads in my codebase, and they need to synchronize data. I mark certain variables as shared, which means I must take special care accessing that data. It works with system locks and mutexes. While locked, I can cast shared away and use it like a normal variable. This is handy with structs and classes.

shared int sensorValue;
sensorValue = 4; // using it like a single-thread variable, error
atomicStore(sensorValue, 4); // works with atomics

SafeD
@safe exists to prohibit sketchy memory activities and enforce best behavior. I haven’t had to fight @safe much yet because I don’t do anything wicked with my memory, but it is comfortable knowing that if I am going to make a mistake, the compiler can assist me in stopping it.

Mental Friction
Adam D. Ruppe puts it succinctly: D has low mental friction. The flexibility and expressiveness of the language make it easy to translate one’s thoughts into written code and maintain productivity. I don’t have to fight D much. This is my personal opinion, but I feel like D is the language in which I’m most productive.

Final Thoughts

D is the perfect fit for this sort of project— I think it’s going to have a bright future ahead in the embedded world. I’m going to continue using D for my projects. I’ve got another D-powered automotive project in the works which I hope to show off in the future. Even if D isn’t yet suitable for your project, keep an eye on it. D has been making enormous strides in the past few years, especially in regards to memory safety.

The examples shown in this article are purely meant to demonstrate how D’s features can be used in the real world. Do not take them as gospel as to how you should program.

Function Generation in D: The Good, the Bad, the Ugly, and the Bolt

Introduction

Digital Mars D logo

A while ago, Andrei Alexandrescu started a thread in the D Programming Language forums, titled “Perfect forwarding”, about a challenge which came up during the July 2020 beerconf:

Write an idiomatic template forward that takes an alias fun and defines (generates) one overload for each overload of fun.

Several people proposed solutions. In the course of the discussion, it arose that there is sometimes a need to alter a function’s properties, like adding, removing, or hanging user-defined attributes or parameter storage classes. This was exactly the problem I struggled with while trying to support all the bells and whistles of D functions in my openmethods library.

Eventually, I created a module that helps with this problem and contributed it to the bolts meta-programming library. But before we get to that, let’s first take a closer look at function generation in D.

The Good

I speculate that many a programmer who is a moderately advanced beginner in D would quickly come up with a mostly correct solution to the “Perfect forwarding” challenge, especially those with a C++ background who have an interest in performing all sorts of magic tricks by means of template meta-programming. The solution will probably look like this:

template forward(alias fun)
{
  import std.traits: Parameters;
  static foreach (
    ovl; __traits(getOverloads, __traits(parent, fun), __traits(identifier, fun))) {
    auto forward(Parameters!ovl args)
    {
      return ovl(args);
    }
  }
}

...

int plus(int a, int b) { return a + b; }
string plus(string a, string b) { return a ~ b; }

assert(forward!plus(1, 2) == 3);        // pass
assert(forward!plus("a", "b") == "ab"); // pass

This solution is not perfect, as we shall see, but it is not far off either. It covers many cases, including some that a beginner may not even be aware of. For example, forward handles the following function without dropping function attributes or parameter storage classes:

class Matrix { ... }

Matrix times(scope const Matrix a, scope const Matrix b) pure @safe
{
  return ...;
}

pragma(msg, typeof(times));
// pure @safe Matrix(scope const(Matrix) a, scope const(Matrix) b)

pragma(msg, typeof(forward!times));
// pure @safe Matrix(scope const(Matrix) _param_0, scope const(Matrix) _param_1)

It even handles user-defined attributes (UDAs) on parameters:

struct testParameter;

void testPlus(@testParameter int a, @testParameter int b);

pragma(msg, typeof(testPlus));
// void(@(testParameter) int a, @(testParameter) int b)

pragma(msg, typeof(forward!testPlus));
// void(@(testParameter) int a, @(testParameter) int b)

Speaking of UDAs, that’s one of the issues with the solution above: it doesn’t carry function UDAs. It also doesn’t work with functions that return a reference. Both issues are easy to fix:

template forward(alias fun)
{
  import std.traits: Parameters;
  static foreach (ovl; __traits(getOverloads, __traits(parent, fun), __traits(identifier, fun)))
  {
    @(__traits(getAttributes, fun)) // copy function UDAs
    auto ref forward(Parameters!ovl args)
    {
      return ovl(args);
    }
  }
}

This solution is still not 100% correct though. If the forwardee is @trusted, the forwarder will be @safe:

@trusted void useSysCall() { ... }

pragma(msg, typeof(&useSysCall));         // void function() @trusted
pragma(msg, typeof(&forward!useSysCall)); // void function() @safe

This happens because the body of the forwarder consists of a single statement calling the useSysCall function. Since calling a trusted function is safe, the forwarder is automatically deemed safe by the compiler.

The Bad

However, Andrei’s challenge was not exactly what we discussed in the previous section. It came with a bit of pseudocode that suggested the template should not be eponymous. In other words, I believe that the exact task was to write a template that would be used like this: forward!fun.fun(...). Here is the pseudocode:

// the instantiation of forward!myfun would be (stylized):

template forward!myfun
{
    void myfun(int a, ref double b, out string c)
    {
        return myfun(a, b, c);
    }
    int myfun(in string a, inout double b)
    {
        return myfun(a, b);
    }
}

Though this looks like a small difference, if we want to implement exactly this, a complication arises. In the eponymous forward, we did not need to create a new identifier; we simply used the template name as the function name. Thus, the function name was fixed. Now we need to create a function with a name that depends on the forwardee’s name. And the only way to do this is with a string mixin.

The first time I had to do this, I tried the following:

template forward(alias fun)
{
  import std.format : format;
  import std.traits: Parameters;
  enum name = __traits(identifier, fun);
  static foreach (ovl; __traits(getOverloads, __traits(parent, fun), name)) {
    @(__traits(getAttributes, fun))
    auto ref mixin(name)(Parameters!ovl args)
    {
      return ovl(args);
    }
  }
}

This doesn’t work because a string mixin can only be used to create expressions or statements. Therefore, the solution is to simply expand the mixin to encompass the entire function definition. The token-quote operator q{} is very handy for this:

template forward(alias fun)
{
  import std.format : format;
  import std.traits: Parameters;
  enum name = __traits(identifier, fun);
  static foreach (ovl; __traits(getOverloads, __traits(parent, fun), name)) {
    mixin(q{
        @(__traits(getAttributes, fun))
          auto ref %s(Parameters!ovl args)
        {
          return ovl(args);
        }
      }.format(name));
  }
}

Though string mixins are powerful, they are essentially C macros. For many D programmers, resorting to a string mixin can feel like a defeat.

Let us now move on to a similar, yet significantly more difficult, challenge:

Write a class template that mocks an interface.

For example:

interface JsonSerializable
{
  string asJson() const;
}

void main()
{
  auto mock = new Mock!JsonSerializable();
}

Extrapolating the techniques acquired during the previous challenge, a beginner would probably try this first:

class Mock(alias Interface) : Interface
{
  import std.format : format;
  import std.traits: Parameters;
  static foreach (member; __traits(allMembers, Interface)) {
    static foreach (fun; __traits(getOverloads, Interface, member)) {
      mixin(q{
          @(__traits(getAttributes, fun))
          auto ref %s(Parameters!fun args)
          {
            // record call
            static if (!is(ReturnType!fun == void)) {
              return ReturnType!fun.init;
            }
          }
        }.format(member));
    }
  }
}

Alas, this fails to compile, throwing errors like:

Error: function `challenge.Mock!(JsonSerializable).Mock.asJson` return type
inference is not supported if may override base class function

In other words, auto cannot be used here. We have to fall back to explicitly specifying the return type:

class Mock(alias Interface) : Interface
{
  import std.format : format;
  import std.traits: Parameters, ReturnType;
  static foreach (member; __traits(allMembers, Interface)) {
    static foreach (fun; __traits(getOverloads, Interface, member)) {
      mixin(q{
          @(__traits(getAttributes, fun))
          ReturnType!fun %s(Parameters!fun args)
          {
            // record call
            static if (!is(ReturnType!fun == void)) {
              return ReturnType!fun.init;
            }
          }
        }.format(member));
    }
  }
}

This will not handle ref functions though. What about adding a ref in front of the return type, like we did in the first challenge?

// as before
          ref ReturnType!fun %s(Parameters!fun args) ...

This will fail with all the functions in the interface that do not return a reference.

The reason why everything worked almost magically in the first challenge is that we called the wrapped function inside the template. It enabled the compiler to deduce almost all of the characteristics of the original function and copy them to the forwarder function. But we have no model to copy from here. The compiler will copy some of the aspects of the function (pure, @safe, etc.) to match those of the overriden function, but not some others (ref, const, and the other modifiers).

Then, there is the issue of the function modifiers: const, immutable, shared, and static. These are yet another category of function “aspects”.

At this point, there is no other option than to analyze some of the function attributes by means of traits, and convert them to a string to be injected in the string mixin:

      mixin(q{
          @(__traits(getAttributes, fun))
          %sReturnType!fun %s(Parameters!fun args)
          {
            // record call
            static if (!is(ReturnType!fun == void)) {
              return ReturnType!fun.init;
            }
          }
        }.format(
            (functionAttributes!fun & FunctionAttribute.const_ ? "const " : "")
          ~ (functionAttributes!fun & FunctionAttribute.ref_ ? "ref " : "")
          ~ ...,
          member));
    }

If you look at the implementation of std.typecons.wrap, you will see that part of the code deals with synthesizing bits of a string mixin for the storage classes and modifiers.

The Ugly

So far, we have looked at the function storage classes, modifiers, and UDAs, but we have merely passed the parameter list as a single, monolithic block. However, sometimes we need to perform adjustments to the parameter list of the generated function. This may seem far-fetched, but it does happen. I encountered this problem in my openmethods library. During the “Perfect forwarding” discussion, it appeared that I was not the only one who wanted to do this.

I won’t delve into the details of openmethods here (see an older blog post for an overview of the module); for the purpose of this article, it suffices to say that, given a function declaration like this one:

Matrix times(virtual!Matrix a, double b);

openmethods generates this function:

Matrix dispatcher(Matrix a, double b)
{
  return resolve(a)(a, b);
}

The virtual template is a marker; it indicates which parameters should be taken into account (i.e., passed to resolve) when picking the appropriate specialization of times. Note that only a is passed to the resolve function—that is because the first parameter uses the virtual! marker and the second does not.

Bear in mind, though, that dispatcher is not allowed to use the type of the parameters directly. Inside the openmethods module, there is no Matrix type. Thus, when openmethods is handed a function declaration, it needs to synthesize a dispatcher function that refers to the declaration’s parameter types exclusively via the declaration. In other words, it needs to use the ReturnType and Parameters templates from std.traits to extract the types involved in the declaration – just like we did in the examples above.

Let’s put aside function attributes and UDAs – we already discussed those in the previous section. The obvious solution then seems to be:

ReturnType!times dispatcher(
  RemoveVirtual!(Parameters!times[0]) a, Parameters!times[1] b)
{
  return resolve(a)(a, b);
}

pragma(msg, typeof(&dispatcher)); // Matrix function(Matrix, double)

where RemoveVirtual is a simple template that peels off the virtual! marker from the type.

Does this preserve parameter storage classes and UDAs? Unfortunately, it does not:

@nogc void scale(ref virtual!Matrix m, lazy double by);

@nogc ReturnType!scale dispatcher(RemoveVirtual!(Parameters!scale[0]) a, Parameters!scale[1] b)
{
  return resolve(a)(a, b);
}

pragma(msg, typeof(&dispatcher)); // void function(Matrix a, double b)

We lost the ref on the first parameter and the lazy on the second. What happened to them?

The culprit is Parameters. This template is a wrapper around an obscure feature of the is operator used in conjunction with the __parameters type specialization. And it is quite a strange beast. We used it above to copy the parameter list of a function, as a whole, to another function, and it worked perfectly. The problem is what happens when you try to process the parameters separately. Let’s look at a few examples:

pragma(msg, Parameters!scale.stringof); // (ref virtual!(Matrix), lazy double)
pragma(msg, Parameters!scale[0].stringof); // virtual!(Matrix)
pragma(msg, Parameters!scale[1].stringof); // double

We see that accessing a parameter individually returns the type… and discards everything else!

There is actually a way to extract everything about a single parameter: use a slice instead of an element of the paramneter pack (yes, this is getting strange):

pragma(msg, Parameters!scale[0..1].stringof); // (ref virtual!(Matrix))
pragma(msg, Parameters!scale[1..2].stringof); // (lazy double)

So this gives us a solution for handling the second parameter of scale:

ReturnType!scale dispatcher(???, Parameters!scale[1..2]) { ... }

But what can we put in place of ???. RemoveVirtual!(Parameters!scale[0..1]) would not work. RemoveVirtual expects a type, and Parameters!scale[1..2] is not a type—it is a sort of conglomerate that contains a type, and perhaps storage classes, type constructors, and UDAs.

At this point, we have no other choice but to construct a string mixin once again. Something like this:

mixin(q{
    %s ReturnType!(scale) dispatcher(
      %s RemoveVirtual!(Parameters!(scale)[1]) a,
      Parameters!(scale)[1..2] b)
    {
        resolve(a)(a, b);
    }
  }.format(
    functionAttributes!scale & FunctionAttribute.nogc ? "@nogc " : ""
    /* also handle other function attributes */,
    __traits(getParameterStorageClasses, scale, 0)));

pragma(msg, typeof(dispatcher)); // @nogc void(ref double a, lazy double)

This is not quite sufficient though, because it still doesn’t take care of parameter UDAs.

To Boltly Refract…

openmethods once contained kilometers of mixin code like the above. Such heavy use of string mixins was too ugly and messy, so much so that the project began to feel less like fun and more like work. So I decided to sweep all the ugliness under a neat interface, once and for all. The result was a “refraction” module, which I later carved out of openmethods and donated to Ali Akhtarzada’s excellent bolts library. bolts attempts to fill in the gaps and bring some regularity to D’s motley set of meta-programming features.

refraction’s entry point is the refract function template. It takes a function and an “anchor” string, and returns an immutable Function object that captures all the aspects of a function. Function objects can be used at compile-time. It is, actually, their raison d’être.

Function has a mixture property that returns a declaration for the original function. For example:

Matrix times(virtual!Matrix a, double b);
pragma(msg, refract!(times, "times").mixture);
// @system ReturnType!(times) times(Parameters!(times) _0);

Why does refract need the anchor string? Can’t the string "times" be inferred from the function by means of __traits(identifier...)? Yes, it can, but in real applications we don’t want to use this. The whole point of the library is to be used in templates, where the function is typically passed to refract via an alias. In general, the function’s name has no meaning in the template’s scope—or if, by chance, the name exists, it does not name the function. All the meta-expressions used to dissect the function must work in terms of the local symbol that identifies the alias.

Consider:

module matrix;

Matrix times(virtual!Matrix a, double b);

Method!times timesMethod; // openmethods creates a `Method` object for each
                          // declared method

module openmethods;

struct Method(alias fun)
{
    enum returnTypeMixture = refract!(fun, "fun").returnType;
    pragma(msg, returnTypeMixture);              // ReturnType!(fun)
    mixin("alias R = ", returnTypeMixture, ";"); // ok
    pragma(msg, R.stringof);                     // Matrix
}

There is no times and no Matrix in module openmethods. Even if they existed, they could not be the times function and the Matrix class from module matrix, as this would require a circular dependency between the two modules, something that D forbids by default. However, there is a fun symbol, and it aliases to the function; thus, the return type can be expressed as ReturnType!(fun).

All aspects of the function are available piecemeal. For example:

@nogc void scale(ref virtual!Matrix m, lazy double by);
pragma(msg, refract!(scale, "scale").parameters[0].storageClasses); // ["ref"]

Function also has methods that return a new Function object, with an alteration to one of the aspects. They can be used to create a variation of a function. For example:

pragma(msg,
  refract!(scale, "scale")
  .withName("dispatcher")
  .withBody(q{{ resolve(_0[0])(_0); }})
  .mixture
);

@nogc @system ReturnType!(scale) dispatcher(ref Parameters!(scale)[0] _0, lazy Parameters!(scale)[1] _1)
{
  resolve(_0[0])(_0);
}

This is the reason behind the name “refraction”: the module creates a blueprint of a function, performs some alterations on it, and returns a string—called a mixture—which, when passed to mixin, will create a new function.

openmethods needs to change the type of the first parameter while preserving storage classes. With bolts.experimental.refraction, this becomes easy:

original = refract!(scale, "scale");

pragma(msg,
  original
  .withName("dispatcher")
  .withParameters(
    [original.parameters[0].withType(
        "RemoveVirtual!(%s)".format(original.parameters[0].type)),
     original.parameters[1],
    ])
  .withBody(q{{
      return resolve(_0)(%s);
   }}.format(original.argumentMixture))
);

This time, the generated code splits the parameter pack into individual components:

@nogc @system ReturnType!(scale) dispatcher(
  ref RemoveVirtual!(Parameters!(scale)[0]) _0, Parameters!(scale)[1..2] _1)
{
  return resolve(_0)(_0);
}

Note how the first and second parameters are handled differently. The first parameter is cracked open because we need to replace the type. That forces us to access the first Parameters value via indexing, and that loses the storage classes, UDAs, etc. So they need to be re-applied explicitly.

On the other hand, the second parameter does not have this problem. It is not edited; thus, the Parameters slice trick can be used. The lazy is indeed there, but it is inside the parameter conglomerate.

Conclusion

Initially, D looked almost as good as Lisp for generating functions. As we tried to gain finer control of the generated function, our code started to look a lot more like C macros; in fact, in some respects, it was even worse: we had to put an entire function definition in a string mixin just to set its name.

This is due to the fact that D is not as “regular” a language as Lisp. Some of the people helming the evolution of D are working on changing this, and it is my hope that an improved D will emerge in the not-too-distant future.

In the meantime, the experimental refraction module from the bolts meta-programming library offers a saner, easier way of generating functions without compromising on the idiosyncrasies that come with them. It allows you to pretend that functions can be disassembled and reassembled at will, while hiding all the gory details of the string mixins that are necessarily involved in that task.


Jean-Louis Leroy is not French, but Belgian. He got his first taste of programming from a HP-25 calculator. His first real programming language was Forth, where CTFE is pervasive. Later he programmed (a little) in Lisp and Smalltalk, and (a lot) in C, C++, and Perl. He now works for Bloomberg LP in New York. His interests include object-relational mapping, open multi-methods, DSLs, and language extensions in general.

A Pattern for Head-mutable Structures

When Andrei Alexandrescu introduced ranges to the D programming language, the gap between built-in and user-defined types (UDTs) narrowed, enabling new abstractions and greater composability. Even today, though, UDTs are still second-class citizens in D. One example of this is support for head mutability—the ability to manipulate a reference without changing the referenced value(s). This document details a pattern that will further narrow the UDT gap by introducing functions for defining and working with head-mutable user-defined types.

Introduction

D is neither Kernel nor Scheme—it has first-class and second-class citizens. Among its first-class citizens are arrays and pointers. One of the benefits these types enjoy is implicit conversion to head-mutable. For instance, const(T[]) is implicitly convertible to const(T)[]. Partly to address this difference, D has many ways to define how one type may convert to or behave like another – alias this, constructors, opDispatch, opCast, and, of course, subclassing. The way pointers and dynamic arrays decay into their head-mutable variants is different from the semantics of any of these features, so we would need to define a new type of conversion if we were to mimic this behavior.

Changing the compiler and the language to permit yet another way of converting one type into another is not desirable: it makes the job harder for compiler writers, makes an already complex language even harder to learn, and any implicit conversion can make code harder to read and maintain. If we can define conversions to head-mutable data structures without introducing compiler or language changes, this will also make the feature available to users sooner, since such a mechanism would not necessarily require changes in the standard library, and users could gradually implement it in their own code and benefit from the code in the standard library catching up at a later point.

Unqual

The tool used today to get a head-mutable version of a type is std.traits.Unqual. In some cases, this is the right tool—it strips away one layer of const, immutable, inout, and shared. For some types though, it either does not give a head-mutable result, or it gives a head-mutable result with mutable indirections:

struct S(T) {
    T[] arr;
}

With Unqual, this code fails to compile:

void foo(T)(T a) {
    Unqual!T b = a; // cannot implicitly convert immutable(S!int) to S!int
}

unittest {
    immutable s = S!int([1,2,3]);
    foo(s);
}

A programmer who sees that message hopefully finds a different way to achieve the same goal. However, the error message says that the conversion failed, indicating that a conversion is possible, perhaps even without issue. An inexperienced programmer, or one who knows that doing so is safe right now, could use a cast to shut the compiler up:

void bar(T)(T a) {
    Unqual!T b = cast(Unqual!T)a;
    b.arr[0] = 4;
}

unittest {
    immutable s = S!int([1,2,3]);
    bar(s);
    assert(s.arr[0] == 1); // Fails, since bar() changed it.
}

If, instead of S!int, the programmer had used int[], the first example would have compiled, and the cast in the second example would have never seen the light of day. However, since S!int is a user-defined type, we are forced to write a templated function that either fails to compile for some types it really should support or gives undesirable behavior at run time.

headMutable()

Clearly, we should be able to do better than Unqual, and in fact we can. D has template this parameters which pick up on the dynamic type of the this reference, and with that, its const or immutable status:

struct S {
    void foo(this T)() {
        import std.stdio : writeln;
        writeln(T.stringof);
    }
}
unittest {
    S s1;
    const S s2;
    s1.foo(); // Prints "S".
    s2.foo(); // Prints "const(S)".
}

This way, the type has the necessary knowledge of which type qualifiers a head-mutable version needs. We can now define a method that uses this information to create the correct head-mutable type:

struct S(T) {
    T[] arr;
    auto headMutable(this This)() const {
        import std.traits : CopyTypeQualifiers;
        return S!(CopyTypeQualifiers!(This, T))(arr);
    }
}
unittest {
    const a = S!int([1,2,3]);
    auto b = a.headMutable();
    assert(is(typeof(b) == S!(const int))); // The correct part of the type is now const.
    assert(a.arr is b.arr); // It's the same array, no copying has taken place.
    b.arr[0] = 3; // Correctly fails to compile: cannot modify const expression.
}

Thanks to the magic of Uniform Function Call Syntax, we can also define headMutable() for built-in types:

auto headMutable(T)(T value) {
    import std.traits;
    import std.typecons : rebindable;
    static if (isPointer!T) {
        // T is a pointer and decays naturally.
        return value;
    } else static if (isDynamicArray!T) {
        // T is a dynamic array and decays naturally.
        return value;
    } else static if (!hasAliasing!(Unqual!T)) {
        // T is a POD datatype - either a built-in type, or a struct with only POD members.
        return cast(Unqual!T)value;
    } else static if (is(T == class)) {
        // Classes are reference types, so only the reference may be made head-mutable.
        return rebindable(value);
    } else static if (isAssociativeArray!T) {
        // AAs are reference types, so only the reference may be made head-mutable.
        return rebindable(value);
    } else {
        static assert(false, "Type "~T.stringof~" cannot be made head-mutable.");
    }
}
unittest {
    const(int*[3]) a = [null, null, null];
    auto b = a.headMutable();
    assert(is(typeof(b) == const(int)*[3]));
}

Now, whenever we need a head-mutable variable to point to tail-const data, we can simply call headMutable() on the value we need to store. Unlike the ham-fisted approach of casting to Unqual!T, which may throw away important type information and also silences any error messages that may inform you of the foolishness of your actions, attempting to call headMutable() on a type that doesn’t support it will give an error message explaining what you tried to do and why it didn’t work (“Type T cannot be made head-mutable.”). The only thing missing now is a way to get the head-mutable type. Since headMutable() returns a value of that type, and is defined for all types we can convert to head-mutable, that’s a template one-liner:

import std.traits : ReturnType;
alias HeadMutable(T) = ReturnType!((T t) => t.headMutable());

Where Unqual returns a type with potentially the wrong semantics and only gives an error once you try assigning to it, HeadMutable disallows creating the type in the first place. The programmer will have to deal with that before casting or otherwise coercing a value into the variable. Since HeadMutable uses headMutable() to figure out the type, it also gives the same informative error message when it fails.

Lastly, since one common use case requires us to preserve the tail-const or tail-immutable properties of a type, it is beneficial to define a template that converts to head-mutable while propagating const or immutable using std.traits.CopyTypeQualifiers:

import std.traits : CopyTypeQualifiers;
alias HeadMutable(T, ConstSource) = HeadMutable!(CopyTypeQualifiers!(ConstSource, T));

This way, immutable(MyStruct!int) can become MyStruct!(immutable int), while the const version would propagate constness instead of immutability.

Example Code

Since the pattern for range functions in Phobos is to have a constructor function (e.g. map) that forwards its arguments to a range type (e.g. MapResult), the code changes required to use headMutable() are rather limited. Likewise, user code should generally not need to change at all in order to use headMutable(). To give an impression of the code changes needed, I have implemented map and equal:

import std.range;

// Note that we check not if R is a range, but if HeadMutable!R is
auto map(alias Fn, R)(R range) if (isInputRange!(HeadMutable!R)) {
    // Using HeadMutable!R and range.headMutable() here.
    // This is basically the extent to which code that uses head-mutable data types will need to change.
    return MapResult!(Fn, HeadMutable!R)(range.headMutable());
}

struct MapResult(alias Fn, R) {
    R range;
    
    this(R _range) {
        range = _range;
    }
    
    void popFront() {
        range.popFront();
    }
    
    @property
    auto ref front() {
        return Fn(range.front);
    }
    
    @property
    bool empty() {
        return range.empty;
    }
    
    static if (isBidirectionalRange!R) {
        @property
        auto ref back() {
            return Fn(range.back);
        }

        void popBack() {
            range.popBack();
        }
    }

    static if (hasLength!R) {
        @property
        auto length() {
            return range.length;
        }
        alias opDollar = length;
    }

    static if (isRandomAccessRange!R) {
        auto ref opIndex(size_t idx) {
            return Fn(range[idx]);
        }
    }

    static if (isForwardRange!R) {
        @property
        auto save() {
            return MapResult(range.save);
        }
    }
    
    static if (hasSlicing!R) {
        auto opSlice(size_t from, size_t to) {
            return MapResult(range[from..to]);
        }
    }
    
    // All the above is as you would normally write it.
    // We also need to implement headMutable().
    // Generally, headMutable() will look very much like this - instantiate the same
    // type template that defines typeof(this), use HeadMutable!(T, ConstSource) to make
    // the right parts const or immutable, and call headMutable() on fields as we pass
    // them to the head-mutable type.
    auto headMutable(this This)() const {
        alias HeadMutableMapResult = MapResult!(Fn, HeadMutable!(R, This));
        return HeadMutableMapResult(range.headMutable());
    }
}

auto equal(R1, R2)(R1 r1, R2 r2) if (isInputRange!(HeadMutable!R1) && isInputRange!(HeadMutable!R2)) {
    // Need to get head-mutable version of the parameters to iterate over them.
    auto _r1 = r1.headMutable();
    auto _r2 = r2.headMutable();
    while (!_r1.empty && !_r2.empty) {
        if (_r1.front != _r2.front) return false;
        _r1.popFront();
        _r2.popFront();
    }
    return _r1.empty && _r2.empty;
}

unittest {
    // User code does not use headMutable at all:
    const arr = [1,2,3];
    const squares = arr.map!(a => a*a);
    const squaresPlusTwo = squares.map!(a => a+2);
    assert(equal(squaresPlusTwo, [3, 6, 11]));
}

(Note that these implementations are simplified slightly from Phobos code to better showcase the use of headMutable)

The unittest block shows a use case where the current Phobos map would fail—it is perfectly possible to create a const MapResult, but there is no way of iterating over it. Note that only two functions are impacted by the addition of headMutable(): map tests if HeadMutable!R is an input range and converts its arguments to head-mutable when passing them to MapResult, and MapResult needs to implement headMutable(). The rest of the code is exactly as you would otherwise write it.

The implementation of equal() shows a situation where implicit conversions would be beneficial. For const(int[]) the call to headMutable() is superfluous—it is implicitly converted to const(int)[] when passed to the function. For user-defined types however, this is not the case, so the call is necessary in the general case.

While I have chosen to implement a range here, ranges are merely the most common example of a place where headmutable would be useful; the idea has merits beyond ranges. Another type in the standard library that would benefit from headmutable is RefCounted!T: const(RefCounted!(T)) should convert to RefCounted!(const(T)).

Why not Tail-Const?

In previous discussions of this problem, the solution has been described as tail-const, and a function tailConst() has been proposed. While this idea might at first seem the most intuitive solution, it has some problems, which together make headMutable() far superior.

The main problem with tailConst() is that it does not play well with D’s existing const system. It needs to be called on a mutable value, and there is no way to convert a const(Foo!T) to Foo!(const(T)). It thus requires that the programmer explicitly call tailConst() on any value that is to be passed to a function expecting a non-mutable value and, abstain from using const or immutable to convey the same information. This creates a separate world of tail-constness and plays havoc with generic code, which consequently has no way to guarantee that it won’t mutate its arguments.

Secondly, the onus is placed on library users to call tailConst() whenever they pass an argument anywhere, causing an inversion of responsibility: the user has to tell the library that it is not allowed to edit the data instead of the other way around. In the best case, this merely causes unnecessary verbiage. In other cases, the omission of const will lead to mutation of data expected to be immutable.

A minor quibble in comparison is that the tail-const solution also requires the existence of tailImmutable to cover the cases where the values are immutable.

Issues

The ideas outlined in this document concern only conversion to head-mutable. A related issue is conversion to tail const, e.g. from RefCounted!T or RefCounted!(immutable T) to RefCounted!(const T), a conversion that, again, is implicit for arrays and pointers in D today.

One issue that may be serious is the fact that headMutable often cannot be @safe and may, in fact, need to rely on undefined behavior in some places. For instance, RefCounted!T contains a pointer to the actual ref count. For immutable(RefCounted!T), headMutable() would need to cast away immutable, which is undefined behavior per the spec.

The Compiler Solution

It is logical to think that, as with built-in types, headMutable() could be elided in its entirety, and the compiler could handle the conversions for us. In many cases, this would be possible, and in fact the compiler already does so for POD types like struct S { int n; }—a const or immutable S may be assigned to a mutable variable of type S. This breaks down, however, when the type includes some level of mutable indirection. For templated types it would be possible to wiggle the template parameters to see if the resulting type compiles and has fields with the same offsets and similar types, but even such an intelligent solution breaks down in the presence of D’s Turing-complete template system, and some cases will always need to be handled by the implementer of a type.

It is also a virtue that the logic behind such an implementation be understandable to the average D programmer. The best case result of that not being true is that the forums would be inundated with a flood of posts about why types don’t convert the way users expect them to.

For these reasons, headMutable() will be necessary even with compiler support. But what would that support look like? Implicit casting to head-mutable happens in the language today in two situations:

  • Assignment to head-mutable variables: const(int)[] a = create!(const(int[]))(); (all POD types, pointers and arrays)
  • Function calls: fun(create!(const(int[]))(); (only pointers and arrays)

The first is covered by existing language features (alias headMutable this; fits the bill perfectly). The second is not but is equivalent to calling .headMutable whenever a const or immutable value is passed to a function that does not explicitly expect a const or immutable argument. This would change the behavior of existing code, in that templated functions would prefer a.headMutable over a, but would greatly improve the experience of working with const types that do define headMutable(). If headMutable is correctly implemented, the different choice of template instantiations should not cause any actual breakage.

Future Work

While this document proposes to implement the described feature without any changes to the compiler or language, it would be possible for the compiler in the future to recognize headMutable() and call it whenever a type that defines that method is passed to a function that doesn’t explicitly take exactly that type, or upon assignment to a variable that matches headMutable()’s return value. This behavior would mirror the current behavior of pointers and arrays.

Conclusion

It is possible to create a framework for defining head-mutable types in D today without compiler or language changes. It requires a little more code in the methods that use head-mutable types but offers a solution to a problem that has bothered the D community for a long time.

While this document deals mostly with ranges, other types will also benefit from this pattern: smart pointers and mutable graphs with immutable nodes are but two possible examples.

Definitions

Head-mutable

A type is head-mutable if some or all of its members without indirections are mutable. Note that a head-mutable datatype may also have const or immutable members without indirections; the requirement is merely that some subset of its members may be mutated. A head-mutable datatype may be tail-const, tail-immutable or tail-mutable—head-mutable only refers to its non-indirected members. Examples of head-mutable types include const(int)[], int*, string, and Rebindable!MyClass. Types without indirections (like int, float and struct S { int n; }) are trivially head-mutable.

Tail-const

A type is tail-const if some of its members with indirections have the const type qualifier. A tail-const type may be head-mutable or head-const. Examples of tail-const types are const(int)*, const(int[]), const(immutable(int)[])* and string.

Source

The source code for HeadMutable and headMutable is available here.

A Look at Chapel, D, and Julia Using Kernel Matrix Calculations

Introduction

It seems each time you turn around there is a new programming language aimed at solving some specific problem set. Increased proliferation of programming languages and data are deeply connected in a fundamental way, and increasing demand for “data science” computing is a related phenomenon. In the field of scientific computing, Chapel, D, and Julia are highly relevant programming languages. They arise from different needs and are aimed at different problem sets: Chapel focuses on data parallelism on single multi-core machines and large clusters; D was initially developed as a more productive and safer alternative to C++; Julia was developed for technical and scientific computing and aimed at getting the best of both worlds—the high performance and safety of static programming languages and the flexibility of dynamic programming languages. However, they all emphasize performance as a feature. In this article, we look at how their performance varies over kernel matrix calculations and present approaches to performance optimization and other usability features of the languages.

Kernel matrix calculations form the basis of kernel methods in machine learning applications. They scale rather poorly—O(m n^2), where n is the number of items and m is the number of elements in each item. In our exercsie, m will be constant and we will be looking at execution time in each implementation as n increases. Here m = 784 and n = 1k, 5k, 10k, 20k, 30k, each calculation is run three times and an average is taken. We disallow any use of BLAS and only allow use of packages or modules from the standard library of each language, though in the case of D the benchmark is compared with calculations using Mir, a multidimensional array package, to make sure that my matrix implementation reflects the true performance of D. The details for the calculation of the kernel matrix and kernel functions are given here.

While preparing the code for this article, the Chapel, D, and Julia communities were very helpful and patient with all inquiries, so they are acknowledged here.

In terms of bias, going in I was much more familiar with D and Julia than I was with Chapel. However, getting the best performance from each language required a lot of interaction with each programming community, and I have done my best to be aware of my biases and correct for them where necessary.

Language Benchmarks for Kernel Matrix Calculation

The above chart (generated using R’s ggplot2 using a script) shows the performance benchmark time taken against the number of items n for Chapel, D, and Julia, for nine kernels. D performs best in five of the nine kernels, Julia performs best in two of the nine kernels, and in two of the kernels (Dot and Gaussian) the picture is mixed. Chapel was the slowest for all of the kernel functions examined.

It is worth noting that the mathematics functions used in D were pulled from C’s math API made available in D through its core.stdc.math module because the mathematical functions in D’s standard library std.math can be quite slow. The math functions used are given here. By way of comparison, consider the mathdemo.d script comparing the imported C log function D’s log function from std.math:

$ ldc2 -O --boundscheck=off --ffast-math --mcpu=native --boundscheck=off mathdemo.d && ./mathdemo
Time taken for c log: 0.324789 seconds.
Time taken for d log: 2.30737 seconds.

The Matrix object used in the D benchmark was implemented specifically because the use of modules outside standard language libraries was disallowed. To make sure that this implementation is competitive, i.e., it does not unfairly represent D’s performance, it is compared to Mir’s ndslice library written in D. The chart below shows matrix implementation times minus ndslice times; negative means that ndslice is slower, indicating that the implementation used here does not negatively represent D’s performance.

Environment

The code was run on a computer with an Ubuntu 20.04 OS, 32 GB memory, and an Intel® Core™ i9–8950HK CPU @ 2.90GHz with 6 cores and 12 threads.

$ julia --version
julia version 1.4.1
$ dmd --version
DMD64 D Compiler v2.090.1
ldc2 --version
LDC - the LLVM D compiler (1.18.0):
  based on DMD v2.088.1 and LLVM 9.0.0
$ chpl --version
chpl version 1.22.0

Compilation

Chapel:

chpl script.chpl kernelmatrix.chpl --fast && ./script

D:

ldc2 script.d kernelmatrix.d arrays.d -O5 --boundscheck=off --ffast-math -mcpu=native && ./script

Julia (no compilation required but can be run from the command line):

julia script.jl

Implementations

Efforts were made to avoid non-standard libraries while implementing these kernel functions. The reasons for this are:

  • To make it easy for the reader after installing the language to copy and run the code. Having to install external libraries can be a bit of a “faff”.
  • Packages outside standard libraries can go extinct, so avoiding external libraries keeps the article and code relevant.
  • It’s completely transparent and shows how each language works.

Chapel

Chapel uses a forall loop to parallelize over threads. Also, C pointers to each item are used rather than the default array notation, and guided iteration over indices is used:

proc calculateKernelMatrix(K, data: [?D] ?T)
{
  var n = D.dim(0).last;
  var p = D.dim(1).last;
  var E: domain(2) = {D.dim(0), D.dim(0)};
  var mat: [E] T;
  var rowPointers: [1..n] c_ptr(T) =
    forall i in 1..n do c_ptrTo(data[i, 1]);

  forall j in guided(1..n by -1) {
    for i in j..n {
      mat[i, j] = K.kernel(rowPointers[i], rowPointers[j], p);
      mat[j, i] = mat[i, j];
    }
  }
  return mat;
}

Chapel code was the most difficult to optimize for performance and required the highest number of code changes.

D

D uses a taskPool of threads from its std.parallel package to parallelize code. The D code underwent the fewest number of changes for performance optimization—a lot of the performance benefits came from the specific compiler used and the flags selected (discussed later). My implementation of a Matrix allows columns to be selected by reference via refColumnSelect.

auto calculateKernelMatrix(alias K, T)(K!(T) kernel, Matrix!(T) data)
{
  long n = data.ncol;
  auto mat = Matrix!(T)(n, n);

  foreach(j; taskPool.parallel(iota(n)))
  {
    auto arrj = data.refColumnSelect(j).array;
    foreach(long i; j..n)
    {
      mat[i, j] = kernel(data.refColumnSelect(i).array, arrj);
      mat[j, i] = mat[i, j];
    }
  }
  return mat;
}

Julia

The Julia code uses the @threads macro for parallelising the code and @views macro for referencing arrays. One confusing thing about Julia’s arrays is their reference status. Sometimes, as in this case, arrays will behave like value objects and they have to be referenced by using the @views macro, otherwise they generate copies. At other times they behave like reference objects, for example, when passing them into a function. It can be a little tricky dealing with this because you don’t always know what set of operations will generate a copy, but where this occurs @views provides a good solution.

The Symmetric type saves the small bit of extra work needed for allocating to both sides of the matrix.

function calculateKernelMatrix(Kernel::K, data::Array{T}) where {K <: AbstractKernel,T <: AbstractFloat}
  n = size(data)[2]
  mat = zeros(T, n, n)
  @threads for j in 1:n
      @views for i in j:n
          mat[i,j] = kernel(Kernel, data[:, i], data[:, j])
      end
  end
  return Symmetric(mat, :L)
end

The @bounds and @simd macros in the kernel functions were used to turn bounds checking off and apply SIMD optimization to the calculations:

struct DotProduct <: AbstractKernel end
@inline function kernel(K::DotProduct, x::AbstractArray{T, N}, y::AbstractArray{T, N}) where {T,N}
  ret = zero(T)
  m = length(x)
  @inbounds @simd for k in 1:m
      ret += x[k] * y[k]
  end
  return ret
end

These optimizations are quite visible but very easy to apply.

Memory Usage

The total time for each benchmark and the total memory used was captured using the /usr/bin/time -v command. The output for each of the languages is given below.

Chapel took the longest total time but consumed the least amount of memory (nearly 6GB RAM peak memory):

Command being timed: "./script"
	User time (seconds): 113190.32
	System time (seconds): 6.57
	Percent of CPU this job got: 1196%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:37:39
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 5761116
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 1439306
	Voluntary context switches: 653
	Involuntary context switches: 1374820
	Swaps: 0
	File system inputs: 0
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

D consumed the highest amount of memory (around 20GB RAM peak memory) but took less total time than Chapel to execute:

Command being timed: "./script"
	User time (seconds): 106065.71
	System time (seconds): 58.56
	Percent of CPU this job got: 1191%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 2:28:29
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 20578840
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 0
	Minor (reclaiming a frame) page faults: 18249033
	Voluntary context switches: 3833
	Involuntary context switches: 1782832
	Swaps: 0
	File system inputs: 0
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Julia consumed a moderate amount of memory (around 7.5 GB peak memory) but ran the quickest—probably because its random number generator is the fastest:

Command being timed: "julia script.jl"
	User time (seconds): 49794.85
	System time (seconds): 30.58
	Percent of CPU this job got: 726%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 1:54:18
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 7496184
	Average resident set size (kbytes): 0
	Major (requiring I/O) page faults: 794
	Minor (reclaiming a frame) page faults: 38019472
	Voluntary context switches: 2629
	Involuntary context switches: 523063
	Swaps: 0
	File system inputs: 368360
	File system outputs: 8
	Socket messages sent: 0
	Socket messages received: 0
	Signals delivered: 0
	Page size (bytes): 4096
	Exit status: 0

Performance optimization

The process of performance optimization in all three languages was very different, and all three communities were very helpful in the process. But there were some common themes.

  • Static dispatching of kernel functions instead of using polymorphism. This means that when passing the kernel function, use parametric (static compile time) polymorphism rather than runtime (dynamic) polymorphism where dispatch with virtual functions carries a performance penalty.
  • Using views/references rather than copying data over multiple threads makes a big difference.
  • Parallelising the calculations makes a huge difference.
  • Knowing if your array is row/column major and using that in your calculation makes a huge difference.
  • Bounds checks and compiler optimizations make a tremendous difference, especially in Chapel and D.
  • Enabling SIMD in D and Julia made a contribution to the performance. In D this was done using the -mcpu=native flag, and in Julia this was done using the @simd macro.

In terms of language-specific issues, getting to performant code in Chapel was the most challenging, and the Chapel code changed the most from easy-to-read array operations to using pointers and guided iterations. But on the compiler side it was relatively easy to add --fast and get a large performance boost.

The D code changed very little, and most of the performance was gained by the choice of compiler and its optimization flags. D’s LDC compiler is rich in terms of options for performance optimization. It has 8 -O optimization levels, but some are repetitions of others. For instance, -O, -O3, and -O5 are identical, and there are myriad other flags that affect performance in various ways. In this case the flags used were -O5 --boundscheck=off –ffast-math, representing aggressive compiler optimizations, bounds checking, and LLVM’s fast-math, and -mcpu=native to enable CPU vectorization instructions.

In Julia the macro changes discussed previously markedly improved the performance, but they were not too intrusive. I tried changing the optimization -O level, but this did not improve performance.

Quality of life

This section examines the relative pros and cons around the convenience and ease of use of each language. People underestimate the effort it takes to use a language day-to-day; the support and infrastructure required is significant, so it is worth comparing various facets of each language. Readers seeking to avoid the TLDR should scroll to the end of this section for the table comparing the language features discussed here. Every effort has been made to be as objective as possible, but comparing programming languages is difficult, bias prone, and contentious, so read this section with that in mind. Some elements looked at, such as arrays, are from the “data science”/technical/scientific computing point of view, and others are more general.

Interactivity

Programmers want a fast code/compile/result loop during development to quickly observe results and outputs in order to make progress or necessary changes. Julia’s interpreter is hands down the best for this and offers a smooth and feature-rich development experience, and D comes a close second. This code/compile/result loop in compilers can be slow even when compiling small code volumes. D has three compilers, the standard DMD compiler, the LLVM-based LDC compiler, and the GCC-based GDC. In this development process, the DMD and LDC compilers were used. DMD has very fast compilation times which is great for development. The LDC compiler is great at creating fast code. Chapel’s compiler is very slow in comparison. To give an example, running Linux’s time command on DMD vs Chapel’s compiler for the kernel matrix code with no optimizations gives us for D:

real	0m0.545s
user	0m0.447s
sys	0m0.101s

Compared with Chapel:

real	0m5.980s
user	0m5.787s
sys	0m0.206s

That’s a large actual and psychological difference, it can make programmers reluctant to check their work and delay the development loop if they have to wait for outputs, especially when source code increases in volume and compilation times become significant.

It is worth mentioning, however, that when developing packages in Julia, compilation times can be very long, and users have noticed that when they load some packages ,compilation times can stretch. So the experience of the development loop in Julia could vary, but in this specific case the process was seamless.

Documentation and examples

One way of comparing documentation in the different languages is to compare them all with Python’s official documentation, which is the gold standard for programming languages. It combines examples with formal definitions and tutorials in a seamless and user-friendly way. Since many programmers are familiar with the Python documentation, this approach gives an idea of how they compare.

Julia’s documentation is the closest to Python’s documentation quality and gives the user a very smooth, detailed, and relatively painless transition into the language. It also has a rich ecosystem of blogs, and topics on many aspects of the language are easy to come by. D’s official documentation is not as good and can be challenging and frustrating, however there is a very good free book “Programming in D” which is a great introduction to the language, but no single book can cover a programming language and there are not many sources for advanced topics. Chapel’s documentation is quite good for getting things done, though examples vary in presence and quality. Often, the programmer needs a lot of knowledge to look in the right place. A good topic for comparison is file I/O libraries in Chapel, D, and Julia. Chapel’s I/O library has too few examples but is relatively clear and straightforward; D’s I/O is kind of spread across a few modules, and documentation is more difficult to follow; Julia’s I/O documentation has lots of examples and is clear and easy to follow.

Perhaps one factor affecting Chapel’s adoption is lack of example—since its arrays have a non-standard interface, the user has to work hard to become familiar with them. Whereas even though D’s documentation may not be as good in places, the language has many similarities to C and C++, so it gets away with more sparse documentation.

Multi-dimensional Array support

“Arrays” here does not refer to native C and C++ style arrays available in D, but mathematical arrays. Julia and Chapel ship with array support and D does not, but it has the Mir library which has multidimensional arrays (ndslice). In the implementation of kernel matrix, I wrote my own matrix object in D, which is not difficult if you understand the principle, but it’s not something a user wants to do. However, D has a linear algebra library called Lubeck which has impressive performance characteristics and interfaces with all the usual BLAS implementations. Julia’s arrays are by far the easiest and most familiar. Chapel’s arrays are more difficult to get started with than Julia’s but are designed to be run on single-core, multi-core, and computer clusters using the same or very similar code, which is a good unique selling point.

Language power

Since Julia is a dynamic programming language, some might say, “well Julia is a dynamic language which is far more permissive than static programming languages, therefore the debate is over”, but it’s more complicated than that. There is power in static type systems. Julia has a type system similar in nature to type systems from static languages, so you can write code as if you were using a static language, but you can do things reserved only for dynamic languages. It has a highly developed generic and meta-programming syntax and powerful macros. It also has a highly flexible object system and multiple dispatch. This mix of features is what makes Julia the most powerful language of the three.

D was intended to be a replacement for C++ and takes very much after C++ (and also borrows from Java), but makes template programming and compile-time evaluation much more user-friendly than in C++. It is a single dispatch language (though multi-methods are available in a package). Instead of macros, D has string and template “mixins” which serve a similar purpose.

Chapel has generic programming support and nascent support for single dispatch OOP, no macro support, and is not yet as mature as D or Julia in these terms.

Concurrency & Parallel Programming

Nowadays, new languages tout support for concurrency and its popular subset, parallelism, but the details vary a lot between languages. Parallelism is more relevant in this example and all three languages deliver. Writing parallel for loops is straightforward in all three languages.

Chapel’s concurrency model has much more emphasis on data parallelism but has tools for task parallelism and ships with support for cluster-based concurrency.

Julia has good support for both concurrency and parallelism.

D has industry strength support for parallelism and concurrency, though its support for threading is much less well documented with examples.

Standard Library

How good is the standard library of all three languages in general? What range of tasks do they allow users to easily tend to? It’s a tough question because library quality and documentation factor in. All three languages have very good standard libraries. D has the most comprehensive standard library, but Julia is a great second, then Chapel, but things are never that simple. For example, a user seeking to write binary I/O may find Julia the easiest to start with; it has the most straightforward, clear interface and documentation, followed by Chapel, and then D. Though in my implementation of an IDX file reader, D’s I/O was the fastest, but then Julia code was easy to write for cases unavailable in the other two languages.

Package Managers & Package Ecosystems

In terms of documentation, usage, and features, D’s Dub package manager is the most comprehensive. D also has a rich package ecosystem in the Dub website, Julia’s package manager runs tightly integrated with GitHub and is a good package system with good documentation. Chapel has a package manager but does not have a highly developed package ecosystem.

C Integration

C interop is easy in all three languages; Chapel has good documentation but is not as well popularised as the others. D’s documentation is better and Julia’s documentation is the most comprehensive. Oddly enough though, none of the languages’ documentation show the commands required to compile your own C code and integrate it with the language, which is an oversight especially when it comes to novices. It is, however, easy to search for and find examples for the compilation process in D and Julia.

Community

All three languages have convenient places where users can ask questions. For Chapel, the easiest place is Gitter, for Julia it’s Discourse (though there is a Julia Gitter), and for D it’s the official website forum. The Julia community is the most active, followed by D, and then Chapel. I’ve found that you’ll get good responses from all three communities, but you’ll probably get quicker answers from the D and Julia communities.

Chapel D Julia
Compilation/Interactivty Slow Fast Best
Documentation & Examples Detailed Patchy Best
Multi-dimensional Arrays Yes Native Only
(library support)
Yes
Language Power Good Great Best
Concurrency & Parallelism Great Great Good
Standard Library Good Great Great
Package Manager & Ecosystem Nascent Best Great
C Integration Great Great Great
Community Small Vibrant Largest

Table for quality of life features in Chapel, D & Julia

Summary

If you are a novice programmer writing numerical algorithms and doing calculations based in scientific computing and want a fast language that’s easy to use, Julia is your best bet. If you are an experienced programmer working in the same space, Julia is still a great option. If you specifically want a more conventional, “industrial strength”, statically compiled, high-performance language with all the “bells and whistles”, but want something more productive, safer, and less painful than C++, then D is your best bet. You can write “anything” in D and get great performance from its compilers. If you need to get array calculations happening on clusters, then Chapel is probably the easiest place to go.

In terms of raw performance on this task, D was the winner, clearly performing better in 5 out of the 9 kernels benchmarked. This exercise reveals that Julia’s label as a high-performance language is more than just hype—it has held it’s own against highly competitive languages. It was harder than expected to get competitive performance from Chapel—it took a lot of investigation from the Chapel team to come up with the current solution. However, as the Chapel language matures we could see further improvement.

DustMite: The General-Purpose Data Reduction Tool

If you’ve been around for a while, or are a particularly adventurous developer who enjoys mixing language features in interesting ways, you may have run into one compiler bug or two:

Implementation bugs are inevitably a part of using cutting-edge programming languages. Should you run into one, the steps to proceed are generally as follows:

  1. Reduce the failing program to a minimal, self-contained example.
  2. Add a description of what happens and what you expect to happen.
  3. Post it on the bug tracker.

Nine years ago, an observation was made that when filing and fixing compiler bugs, a disproportionate amount of time was spent on the first step. When your program stops compiling “out of the blue”, or when the bug stops reproducing after the code is taken out of its context, manually paring down a large codebase by repeatedly cutting out code and checking if the bug still occurs becomes a tedious and repetitive task.

Fortunately, tedious and repetitive tasks are what computers are good for; they just have to be tricked into doing them, usually by writing a program. Enter DustMite.


The first version.

The basic operation is simple. The tool takes as inputs:

  • a data set to reduce (such as, a directory containing D source code which exhibits a particular compiler bug)
  • an oracle (or, more mundanely, a test script), which itself:
  • takes as input a variation of the data set, and
  • produces a yes-or-no answer on whether the input still satisfies the sought property (such as reproducing the particular compiler bug).

DustMite’s output is some local minimum variation of the data set, which it reaches by consecutively trying to remove parts of the data set and saving the results which the oracle approves. In the compiler bug example, this means removing bits of code which are not required to reproduce the problem at hand.

DustMite wouldn’t be very efficient if it attempted to remove things line-by-line or character-by-character. In order to maximize the chance of finding good reductions, the input is parsed into a tree according to the syntax of the input files.

Each tree node consists of a “head” (string), children (list of node pointers), and “tail” (string). Technically, it is redundant to have both “head” and “tail”, but they make representing some constructs and performing reductions much simpler, such as paren/bracket pairs.


Nodes are arranged into a binary tree as an optimization.

Additionally, nodes may have a list of dependencies. The dependency relationship essentially means “if this node is removed, these nodes should be removed too”. These constraints are not representable using just the tree structure described above, and are used to allow reducing things such as lists where trailing item delimiters are not allowed, or removing a function parameter and corresponding arguments from the entire code base at once.

In the case of D source code, declarations, statements, and subexpressions get their own tree nodes, so that they can be removed in one go if unneeded. The parser DustMite uses for D source code is intentionally very simple because it needs to handle potentially invalid D code, and you don’t want your bug reduction tool to also crash on top of the compiler.


How DustMite sees a simple D program.

An algorithm decides the order in which nodes are queued for potential deletion; DustMite implements several (it calls them “strategies”). Fundamentally, a strategy’s interface is (statei, resulti) ⇒ (statei+1, reductioni+1), i.e., which reduction is chosen next depends on the previous reduction and its result. The default “inbreadth” strategy visits nodes in ascending depth order (row by row) and starts over from the top as long as it finds new reductions.

DustMite today supports quite a few more options:


The current version.

Probably, the most interesting of these is the -j switch—one reason being that DustMite’s task is inherently not parallelizable. Which reduction is chosen next, and the tree version to which that reduction is applied, depends on the previous reduction’s result.

DustMite works around this by putting unused CPU cores to work on lookahead: using a highly sophisticated predictor, it guesses what the result of the current reduction will be, and based on that assumption, calculates the next reduction. If the guess was right, great! We get to use that result. Otherwise, the work is wasted. Implementing this meant that strategies now needed to have copyable state, and so had to be refactored from an imperative style to a state machine.

Unfortunately, although the highly expected feature was implemented four years ago, the initial implementation was rather underwhelming. DustMite still did too much work in the main thread and wasted too much CPU time on rescanning the data set tree on every reduction. The problem was so bad that, at high core counts, lookahead mode was even slower than single-threaded mode.

I have recently set out to resolve these inadequacies. The following obstacles stood in the way:

Problem 1: Hashing was too slow. Because the oracle’s services (i.e., running the test script) are usually expensive, DustMite keeps track of a cache of previously attempted reductions and their outcome. This helps because not all tree transformations result in a change of output, and some strategies will retry reductions in successive iterations. A hash of the tree is used as the cache key; however, calculating it requires walking the entire tree every time, which is slow for large inputs.

Would it be possible to make the hash calculation incremental? One approach would be Merkle trees (each node’s hash is the hash of its children’s hashes), however that is suboptimal in the case of e.g., empty leaf nodes. CS erudite Ivan Kazmenko blesses us with an answer: polynomial hashes! By representing strings as polynomials, it is possible to use modulo arithmetic to calculate an incremental fixed-size hash and cache subtree hashes per node.



Each node holds its cumulative hash and length.

The number theory staggered me at first, so I recruited the assistance of feep from #d. After we went through a few draft implementations, I could begin working on the final version. The first improvement was replacing the naive exponentiation algorithm with exponentiation by squaring (D CTFE allowed precomputing a table at compile-time and a faster calculation than the classical method). Next, there was the matter of the modulo.

Initially, we used integer overflow for modulo arithmetic (i.e. q=264), however Ivan cautioned against using powers of two as the modulo, as this makes the algorithm susceptible to Thue-Morse strings. Not long ago I was experimenting with using long multiplication/division CPU instructions (where multiplying one machine word by another yields the result in two machine words with a high and low part, and vice-versa for division). D allows generating assembler code specific to the types that the function template is instantiated with, though in DustMite we only use the unsigned 64-bit variant (on x86 we fall back to using integer overflow).

With the hashing algorithm implemented, all that remained was to mark dirty nodes (they or their children had their content edited) and incrementally recalculate their hashes as needed. Dependencies posed a small obstacle: at the time, they were implemented as simply an array of pointers to the dependency node within the tree. As such, we didn’t know how to get to their parents (to mark them dirty as well), however this was easily overcome by adding a “parent” pointer to each node.

Well, or so I thought, until I got to work on the next problem.

Problem 2: Copying the tree. At the time, the current version of the tree representing the data set was part of the global state. Because of this, applying a reduction was implemented twice:

This was clumsy, but faster and less complicated than making a copy of the entire tree just to change one part of it to test a reduction. However, doing so was a requirement for proper lookahead, otherwise we would be unable to test reductions based on results where past tests predicted a positive outcome, or do nearly anything in a separate thread.

One issue was the tree “internal pointers”—making a copy would require updating all pointers within the tree to point to the new copies in the new tree. This was easy for children/parent pointers (since we can reliably visit every such pointer exactly once), but not quite for dependencies: because they were also implemented as simple pointers to nodes, we would have to keep track of a map of which node was copied where in order to update the dependency pointers.

One way to solve this would be to change the representation of node references from pointers to indices into a node array; this way, copying the tree would be as simple as a .dup. However, large inputs meant many nodes, and I wanted to see if it was possible to avoid iterating over every node in the tree (i.e. O(n)) for every reduction.

Was it possible? It would mean that we would copy only the modified nodes and their parents, leaving the rest of the tree in-place, and only reusing it as the copies’ children. This goal conflicted with the existence of “parent” pointers, because a parent would have to point towards either the old or new root, so to resolve this ambiguity every node would have to be copied. As a result, the way we handled dependencies needed to be rethought.


Editing trees with “copy on write” involves copying just the edited nodes (🔴), and their parents.

With internal pointers out, the next best thing to array indices for referencing a node was a series of instructions for how to reach the node from the tree root: an address. The representation of these addresses that I chose was a bit string represented as a linked list, where each list node holds the child index at that depth, starting from the deep end. Such a representation can be arranged in a tree where the root-side ends are shared, mimicking the structure of the tree containing the nodes for the reduced data, and thus allowing us to reuse memory and minimize allocations.


Nodes cannot hold their own address (as that would make them unmovable),
which is why they need to be stored outside of the main tree.

For addresses to work, the object they point at needs to remain the same, which means that we can no longer simply remove children from tree nodes—an address going through the second child would become invalid if the first child was removed. Rewriting all affected addresses for every tree edit is, of course, impractical, which leads us to the introduction of tombstones—dead nodes that only serve to preserve the index of the children that follow it. Because one of the possible reduction types involves moving subtrees around the tree, we now also have “redirects” (which are just tombstones with a “see here” address attached).

With the above changes in place, we can finally move forward with fixing and optimizing lookahead, as well as implementing incremental rehashing in a way that’s compatible with the above! The mutable global “current” tree variable is gone, save now simply takes a tree root as an argument, and applyReduction is now:

/// Apply a reduction to this tree, and return the resulting tree.
/// The original tree remains unchanged.
/// Copies only modified parts of the tree, and whatever references them.
Entity applyReduction(Entity origRoot, ref Reduction r)

With the biggest hurdle behind us, and a few more rounds of applying Walter Bright’s secret weapon, the performance metrics started to look more like what they should:


Going deeper would likely involve using OS-specific I/O APIs or rewriting D’s GC.

A mere 3.5x speed-up from a 32-fold increase in computational power may seem underwhelming. Here are some reasons for this:

  • With a 50/50 predictor, predictions form a complete binary tree, so doubling the number of parallel jobs gives you +1x more speed. That’s roughly log₂(jobs)-1, or 4 for 32 jobs – not far off!

  • The results will depend on the reduction being performed, so YMMV. For a certain artificial test case, one optimization (not pictured above) yielded a 500x speed-up!

  • DustMite does not try to keep all CPU cores busy all the time. If a prediction turns out false, all lookahead jobs based on it become wasted work, so DustMite only starts new lookahead tasks when a reduction’s outcome is resolved. Perhaps ideally DustMite would keep starting new jobs but kill them as soon as it discovers they’re based on a misprediction. As there is no cross-platform process group management in Phobos, the D standard library, this is something I left for future versions.

  • Some work is still done in the main thread, because moving it to a worker thread actually makes things slower due to the global GC lock.

There still remains one last place where DustMite iterates over every tree node per reduction: saving the tree to disk (so that it could be read by the test script). This seems unavoidable at first, but could actually be avoided by caching each node’s full text contents within the node itself.

I opted to leave this one out. With the other related improvements, such as using lockingBinaryWriter and aggregating writes of contiguous strings as one I/O operation, the increase in memory usage was much more dramatic than the decrease in execution time, even when optimized to just one allocation per reduction (polynomial hashing gives us every node’s total length for free). But, for a brief instant, DustMite processed reductions in sub-O(n) time.

One more addition is worth mentioning: Andrej Mitrovic suggested a switch which would replace removed text with whitespace, which would allow searching for exact line numbers in the test script. At the time, its addition posed significant challenges, as there needed to be some way to keep removed nodes in the tree but exclude them from future removal attempts. With the new tree representation, this became much easier, and also allowed creating the following animation:

In conclusion, I’d like to bring up that DustMite is good at more than just reducing compiler test cases. The wiki lists some ideas:

  • Finding the source of ambiguous or misleading compiler error messages (e.g., errors with the file/line information pointing only inside the standard library).

  • Alternative (much slower, but also much more thorough) method of verifying unit test code coverage. Just because a line of code is executed, that doesn’t mean it’s necessary; DustMite can be made to remove all code that does not affect the execution of your unit tests.

  • Similarly, if you have complete test coverage, it can be used for reducing the source tree to a minimal tree which includes support for only enabled unittests. This can be used to create a version of a program or library with a test-defined subset of features.

  • The --obfuscate mode can obfuscate your code’s identifiers. It can be used for preparing submission of proprietary code to bug trackers.

  • The --fuzz mode (a new feature) can help find bugs in compilers and tools by creating random programs (using fragments of other programs as input).

But DustMite is not limited to D programs (or any kind of programs) as input. With the --split option, we can tell DustMite how to parse and reduce other kinds of files. DustMite successfully handled the following scenarios:

  • reducing C++ programs (the D parser supports some C++-only syntax too);

  • reducing Python programs (using the indent split mode);

  • reducing a large commit to a minimal diff (using the diff split mode);

  • reducing a commit list, when git bisect is insufficient due to the problem being introduced across more than any single commit;

  • reducing a large data set to a minimal one, resulting in the same code coverage, with the purpose of creating a test suite;

  • and many more which I do not remember.

Today, some version of DustMite is readily available in major distributions (usually as part of some D-related package), so I’m happy having a favorite tool one apt-get / pacman -S away when I’m not at my PC.

Discovering a problem which can be elegantly reduced away by DustMite is always exciting for me, and I’m hoping you will find it useful too.

Tracing D Applications

At one time or another during application development you need to make a decision: does your application work like it should and, if not, what is wrong with it? There are different techniques to help you decide, some of which are logging, tracing, and profiling. How are they different? One way to look at it is like this:

  • when you know exactly the events you are interested in to make the decision, you use logging
  • when you don’t know exactly the events you need to make a decision and you are forced to collect as many events as you can, you use tracing
  • when you need to collect some events and analyze them to derive new information, you use profiling

In this article, we focus on tracing.

When developing an application, you can use tracing to monitor its characteristics at run time to, for example, estimate its performance or memory consumption. There are several options to do so, and some of them are:

  • means provided by the programming language (for example, using D’s writef, a.k.a. printf debugging)
  • debuggers (using scripts or remote tracing)
  • OS-specific tracing frameworks (linux {k|u}probes and usdt probes, linux kernel event, performance events in windows etc)

In this article, the following contrived D example is used to help illustrate all three cases. We’ll be focusing on Linux. All example code in this article can be found in the GitHub repository at https://github.com/drug007/tracing_post.

foreach(counter; 0..total_cycles)
{
    // randomly generate one of three kinds of event
    Event event = cast(Event) uniform(0, 3);

    // "perform" the job and benchmark its CPU execution time
    switch (event)
    {
        case Event.One:

            doSomeWork;

        break;
        case Event.Two:

            doSomeWork;

        break;
        case Event.Three:

            doSomeWork;

        break;
        default:
            assert(0);
    }
}

doSomeWork simulates a CPU-intensive job by using DRuntime’s Thread.sleep method. This is a very common pattern where an application runs in cycles and, on every iteration, performs a job depending on the application state. Here we can see that the application has three code paths (CaseOne, CaseTwo, and CaseThree). We need to trace the application at run time and collect information about its timings.

The writef-Based Approach

Using writef/ln from Phobos, D’s standard library, to trace the application is naive, but can be very useful nevertheless. The code from tracing_writef.d:

    case Event.One:
            auto sw = StopWatch(AutoStart.no);
            sw.start();

            doSomeWork;

            sw.stop();
            writefln("%d:\tEvent %s took: %s", counter, event, sw.peek);
        break;

With this trivial approach, StopWatch from the standard library is used to measure the execution time of the code block of interest. Compile and run the application with the command dub tracing_writef.d and look at its output:

Running ./example-writef
0:      Event One took:   584 ms, 53 μs, and 5 hnsecs
1:      Event One took:   922 ms, 72 μs, and 6 hnsecs
2:      Event Two took:   1 sec, 191 ms, 73 μs, and 8 hnsecs
3:      Event Two took:   974 ms, 73 μs, and 7 hnsecs
...

There is a price for this—we need to compile tracing code into our binary, we need to manually implement the collection of tracing output, disable it when we need to, and so on—and this means the size of the binary increases. To summarize:

Pros

  • all the might of Phobos is available to employ (except when in BetterC mode)
  • tracing output can be displayed in a human readable format (look at the nice output of Duration above; thanks to Jonathan M. Davis for the std.datetime package)
  • source code is portable
  • easy to use
  • no third-party tools required

Cons

  • the application must be rebuilt and restarted in order to make any changes, which is inappropriate for some applications (such as servers)
  • no low-level access to the application state
  • noise in the code due to the addition of tracing code
  • can be unusable due to a lot of debug output
  • overhead can be large
  • can be hard to use in production

This approach is very suitable in the early stages of development and less useful in the final product. Although, if the tracing logic is fixed and well defined, this approach can be used in production-ready applications/libraries. For example, this approach was suggest by Stefan Koch for tracing the DMD frontend to profile performance and memory consumption.

Debugger-Based Approach

The debugger, in this case GDB, is a more advanced means to trace applications. There is no need to modify the application to change the tracing methodology, making it very useful in production. Instead of compiling tracing logic into the application, breakpoints are set. When the debugger stops execution on a breakpoint, the developer can use the large arsenal of GDB functionality to inspect the internal state of the inferior (which, in GDB terms, usually refers to the process being debugged). It is not possible in this case to use Phobos directly, but helpers are available and, moreover, you have access to registers and the stack—options which are unavailable in the case of writef debugging.

Let’s take a look the code from tracing_gdb.d for the first event:

    case Case.One:

        doSomeWork;

    break;

As you can see, now there is no tracing code and it is much cleaner. The tracing logic is placed in a separate file called trace.gdb. It consists of a series of command blocks configured to execute on specific breakpoints, like this:

set pagination off
set print address off

break app.d:53
commands
set $EventOne = currClock()
continue
end

break app.d:54
commands
set $EventOne = currClock() - $EventOne
printf "%d:\tEvent One   took: %s\n", counter, printClock($EventOne)
continue
end

...

run
quit

In the first line, pagination is switched off. This enables scrolling so that there is no need to press “Enter” or “Q” to continue script execution when the current console fills up. The second line disables showing the address of the current breakpoint in order to make the output less verbose. Then breakpoints are set on lines 53 and 54, each followed by a list of commands (between the commands and end labels) that will be executed when GDB stops on these breakpoints. The breakpoint on line 53 is configured to fetch the current timestamp (using a helper) before doSomeWork is called, and the one on line 54 to get the current timestamp after doSomeWork has been executed. In fact, line 54 is an empty line in the source code, but GDB is smart enough to set the breakpoint on the next available line. $EventOne is a convenience variable where we store the timestamps to calculate code execution time. currClock() and printClock(long) are helpers to let us prettify the formatting by means of Phobos. The last two commands in the script initiate the debugging and quit the debugger when it’s finished.

To build and run this tracing session, use the following commands:

dub build tracing_gdb.d --single
gdb --command=trace.gdb ./tracing-gdb | grep Event

trace.gdb is the name of the GDB script and tracing-gdb is the name of the binary. We use grep to make the GDB output look like writefln output for easier comparison.

Pros

  • the code is clean and contains no tracing code
  • there is no need to recompile the application to change the tracing methodology—in many cases, it’s enough to simply change the GDB script
  • there is no need to restart the application
  • it can be used in production (sort of)
  • there is no overhead if/when not tracing and little when tracing
  • watchpoints and catchpoints can be used instead of breakpoints

Cons

  • using breakpoints in some cases may be inconvenient, annoying, or impossible.
  • GDB’s pretty-printing provides “less pretty” output because of the lack of full Phobos support compared to the writef approach
  • sometimes GDB is not available in production

The point about setting breakpoints in GDB being inconvenient is based on the fact that you can use only an address, a line number, or a function name (see the gdb manual). Using an address is too low level and inconvenient. Line numbers are ephemeral and can easily change when the source file is edited, so the scripts will be broken (this is annoying). Using function names is convenient and stable, but is useless if you need to place a tracing probe inside a function.

A good example of using GDB for tracing is Vladimir Panteleev’s dmdprof.

The USDT-Based Approach

So far we have two ways to trace our application that are complimentary, but is there a way to unify all the advantages of these two approaches and avoid their drawbacks? Of course, the answer is yes. In fact there are several ways to achieve this, but hereafter only one will be discussed: USDT (Userland Statically Defined Tracing).

Unfortunately, due to historical reasons, the Linux tracing ecosystem is fragmented and rather confusing. There is no plain and simple introduction. Get ready to invest much more time if you want to understand this domain. The first well-known, full-fledged tracing framework was DTrace, developed by Sun Microsystems (now it is open source and licensed under the GPL). Yes, strace and ltrace have been around for a long time, but they are limited, e.g., they do not let you trace what happens inside a function call. Today, DTrace is available on Solaris, FreeBSD, macOS, and Oracle Linux. DTrace is not available in other Linux distributions because it was initially licensed under the CDDL. In 2018, it was relicensed under the GPL, but by then Linux had its own tracing ecosystem. As with everything, Open Source has disadvantages. In this case, it resulted in fragmentation. There are now several tools/frameworks/etc. that are able to solve the same problems in different ways but somehow and sometimes can interoperate with each other.

We will be using bpftrace, a high level tracing language for Linux eBPF. In D, USDT probes are provided by the usdt library. Let’s start from the code in tracing_usdt.d:

	case Case.One:
		mixin(USDT_PROBE!("dlang", "CaseOne", kind));

		doSomeWork;

		mixin(USDT_PROBE!("dlang", "CaseOne_return", kind));
	break;

Here we mixed in two probes at the start and the end of the code of interest. It looks similar to the first example using writef, but a huge difference is that there is no logic here. We only defined two probes that are NOP instructions. That means that these probes have almost zero overhead and we can use them in production. The second great advantage is that we can change the logic while the application is running. That is just impossible when using the writef approach. Since we are using bpftrace, we need to write a script, called bpftrace.bt, to define actions that should be performed on the probes:

usdt:./tracing-usdt:dlang:CaseOne
{
	@last["CaseOne"] = nsecs;
}

usdt:./tracing-usdt:dlang:CaseOne_return
{
	if (@last["CaseOne"] != 0)
	{
		$tmp = nsecs;
		$period = ($tmp - @last["CaseOne"]) / 1000000;
		printf("%d:\tEvent CaseOne   took: %d ms\n", @counter++, $period);
		@last["CaseOne"] = $tmp;
		@timing = hist($period);
	}
}
...

The first statement is the action block. It triggers on the USDT probe that is compiled in the ./tracing-usdt executable (it includes the path to the executable) with the dlang provider name and the CaseOne probe name. When this probe is hit, then the global (indicated by the @ sign) associative array last updates the current timestamp for its element CaseOne. This stores the time of the moment the code starts running. The second action block defines actions performed when the CaseOne_return probe is hit. It first checks if corresponding element in the @last associative array is already initialized. This is needed because the application may already be running when the script is executed, in which case the CaseOne_return probe can be fired before CaseOne. Then we calculate how much time code execution took, output it, and store it in a histogram called timing.

The BEGIN and END blocks at the top of bpftrace.bt define actions that should be performed at the beginning and the end of script execution. This is nothing more than initialization and finalization. Build and run the example with:

dub build tracing_usdt.d   --single --compiler=ldmd2 # or gdc
./tracing-usdt &                                     # run the example in background
sudo bpftrace bpftrace.bt                            # start tracing session

Output:

Attaching 8 probes...
0:	Event CaseThree took: 552 ms
1:	Event CaseThree took: 779 ms
2:	Event CaseTwo   took: 958 ms
3:	Event CaseOne   took: 1174 ms
4:	Event CaseOne   took: 1059 ms
5:	Event CaseThree took: 481 ms
6:	Event CaseTwo   took: 1044 ms
7:	Event CaseThree took: 611 ms
8:	Event CaseOne   took: 545 ms
9:	Event CaseTwo   took: 1038 ms
10:	Event CaseOne   took: 913 ms
11:	Event CaseThree took: 989 ms
12:	Event CaseOne   took: 1149 ms
13:	Event CaseThree took: 541 ms
14:	Event CaseTwo   took: 1072 ms
15:	Event CaseOne   took: 633 ms
16:	Event CaseTwo   took: 832 ms
17:	Event CaseTwo   took: 1120 ms
^C



@timing:
[256, 512)             1 |@@@@@                                               |
[512, 1K)             10 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)               7 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                |

In the session output above there are only 18 lines instead of 20; it’s because tracing-usdt was started before the bpftrace script so the first two events were lost. Also, it’s necessary to kill the example by typing Ctrl-C after tracing-usdt completes. After the bpftrace script stops execution, it ouputs the contents of the timing map as a histogram. The histogram says that one-time code execution takes between 256 and 512 ms, ten times between 512 and 1024 ms, and seven times more between 1024 and 2048 ms. These builtin statistics make using bpftrace easy.

Pros

  • provides low-level access to the code (registers, memory, etc.)
  • minimal noise in the code
  • no need to recompile or restart when changing the tracing logic
  • almost zero overhead
  • can be effectively used in production

Cons

  • learning USDT can be hard, particularly considering the state of the Linux tracing ecosystem
  • requires external tools (frontends)
  • OS specific

Note: GDB has had support for USDT probes since version 7.5. To use it, modify the trace.gdb script to set breakpoints using USDT probes instead of line numbers. That eases development because it eliminates the need to synchronize line numbers during source code modification.

Futher reading:

Conclusion

Feature writef gdb usdt
pretty
printing
by means of Phobos
and other libs
by means of
pretty-printing
limited builtins
low-level no yes yes
clean code no yes sort of
recompilation yes no no
restart yes no no
usage
complexity
easy easy+ advanced
third-party
tools
no only debugger tracing system front end
cross platform yes sorta of OS specific
overhead can be large none can be ignored
even in production
production ready sometimes possible sometimes impossible yes

Feature descriptions:

  • pretty printing is important if the tracing output should be read by humans (and can be ignored in the case of inter-machine data exchange)
  • low-level means access to low-level details of the traced binary, e.g., registers or memory
  • clean code characterizes whether additional tracing code which is unrelated to the applications’s business logic would be required.
  • recompilation determines if it is necessary to recompile when changing the tracing methodology
  • restart determines if it is necessary to restart the application when changing the tracing methodology
  • usage complexity indicates the level of development experience that may be required to utilize this technology
  • third-party tools describes tools not provided by standard D language distributions are required to use this technology
  • cross platform indicates if this technology can be used on different platforms without changes
  • overhead – the cost of using this technology
  • production ready – indicates if this technology may be used in a production system without consequences

wc in D: 712 Characters Without a Single Branch

After reading “Beating C With 80 Lines Of Haskell: Wc”, which I found on Hacker News, I thought D could do better. So I wrote a wc in D.

The Program

It consists of one file and has 34 lines and 712 characters.

import std.stdio : writefln, File;
import std.algorithm : map, fold, splitter;
import std.range : walkLength;
import std.typecons : Yes;
import std.uni : byCodePoint;

struct Line {
	size_t chars;
	size_t words;
}

struct Output {
	size_t lines;
	size_t words;
	size_t chars;
}

Output combine(Output a, Line b) pure nothrow {
	return Output(a.lines + 1, a.words + b.words, a.chars + b.chars);
}

Line toLine(char[] l) pure {
	return Line(l.byCodePoint.walkLength, l.splitter.walkLength);
}

void main(string[] args) {
	auto f = File(args[1]);
	Output o = f
		.byLine(Yes.keepTerminator)
		.map!(l => toLine(l))
		.fold!(combine)(Output(0, 0, 0));

	writefln!"%u %u %u %s"(o.lines, o.words, o.chars, args[1]);
}

Sure, it is using Phobos, D’s standard library, but then why wouldn’t it? Phobos is awesome and ships with every D compiler. The program itself does not contain a single if statement. The Haskell wc implementation has several if statements. The D program, apart from the main function, contains three tiny functions. I could have easily put all the functionally in one range chain, but then it probably would have exceeded 80 characters per line. That’s a major code-smell.

The Performance

Is the D wc faster than the coreutils wc? No, but it took me 15 minutes to write mine (I had to search for walkLength, because I forgot its name).

file lines bytes coreutils haskell D
app.d 46 906 3.5 ms +- 1.9 ms 39.6 ms +- 7.8 ms 8.9 ms +- 2.1 ms
big.txt 862 64k 4.7 ms +- 2.0 ms 39.6 ms +- 7.8 ms 9.8 ms +- 2.1 ms
vbig.txt 1.7M 96M 658.6ms +- 24.5ms 226.4 ms +- 29.5 ms 1.102 s +- 0.022 s
vbig2.txt 12.1M 671M 4.4 s +- 0.058 s 1.1 s +- 0.039 s 7.4 s +- 0.085 s

Memory:

file coreutils haskell D
app.d 2052K 7228K 7708K
big.txt 2112K 7512K 7616K
vbig.txt 2288K 42620K 7712K
vbig2.txt 2360K 50860K 7736K

Is the Haskell wc faster? For big files, absolutely, but then it is using threads. For small files, GNU’s coreutils still beats the competition. At this stage my version is very likely IO bound, and it’s fast enough anyway.

I’ll not claim that one language is faster than another. If you spend a chunk of time on optimizing a micro-benchmark, you are likely going to beat the competition. That’s not real life. But I will claim that functional programming in D gives functional programming in Haskell a run for its money.

A Bit About Ranges

Digital Mars D logoA range is an abstraction that you can consume through iteration without consuming the underlying collection (if there is one). Technically, a range can be a struct or a class that adheres to one of a handful of Range interfaces. The most basic form, the InputRange, requires the function

void popFront();

and two members or properties:

T front;
bool empty;

T is the generic type of the elements the range iterates.

In D, ranges are special in a way that other objects are not. When a range is given to a foreach statement, the compiler does a little rewrite.

foreach (e; range) { ... }

is rewritten to

for (auto __r = range; !__r.empty; __r.popFront()) {
    auto e = __r.front;
    ...
}

auto e = infers the type and is equivalent to T e =.

Given this knowledge, building a range that can be used by foreach is easy.

struct Iota {
	int front;
	int end;

	@property bool empty() const {
		return this.front == this.end;
	}

	void popFront() {
		++this.front;
	}
}

unittest {
	import std.stdio;
	foreach(it; Iota(0, 10)) {
		writeln(it);
	}
}

Iota is a very simple range. It functions as a generator, having no underlying collection. It iterates integers from front to end in steps of one. This snippet introduces a little bit of D syntax.

@property bool empty() const {

The @property attribute allows us to use the function empty the same way as a member variable (calling the function without the parenthesis). The trailing const means that we don’t modify any data of the instance we call empty on. The built-in unit test prints the numbers 0 to 10.

Another small feature is the lack of an explicit constructor. The struct Iota has two member variables of type int. In the foreach statement in the test, we create an Iota instance as if it had a constructor that takes two ints. This is a struct literal. When the D compiler sees this, and the struct has no matching constructor, the ints will be assigned to the struct’s member variables from top to bottom in the order of declaration.

The relation between the three members is really simple. If empty is false, front is guaranteed to return a different element, the next one in the iteration, after a call to popFront. After calling popFront the value of empty might have changed. If it is true, this means there are no more elements to iterate and any further calls to front are not valid. According to the InputRange documentation:

  • front can be legally evaluated if and only if evaluating empty has, or would have, equaled false.
  • front can be evaluated multiple times without calling popFront or otherwise mutating the range object or the underlying data, and it yields the same result for every evaluation.

Now, using foreach statements, or loops in general, is not really functional in my book. Lets say we want to filter all uneven numbers of the Iota range. We could put an if inside the foreach block, but that would only make it worse. It would be nicer if we had a range that takes a range and a predicate that can decide if an element is okay to pass along or not.

struct Filter {
	Iota input;
	bool function(int) predicate;

	this(Iota input, bool function(int) predicate) {
		this.input = input;
		this.predicate = predicate;
		this.testAndIterate();
	}

	void testAndIterate() {
		while(!this.input.empty
				&& !this.predicate(this.input.front))
		{
			this.input.popFront();
		}
	}

	void popFront() {
		this.input.popFront();
		this.testAndIterate();
	}

	@property int front() {
		return this.input.front;
	}

	@property bool empty() const {
		return this.input.empty;
	}
}

bool isEven(int a) {
	return a % 2 == 0;
}

unittest {
	foreach(it; Filter(Iota(0,10), &isEven)) {
		writeln(it);
	}
}

Filter is again really simple: it takes one Iota and a function pointer. On construction of Filter, we call testAndIterate, which pops elements from Iota until it is either empty or the predicate returns false. The idea is that the passed predicate decides what to filter out and what to keep. The properties front and empty just forward to Iota. The only thing that actually does any work is popFront. It pops the current element and calls testAndIterate. That’s it. That’s an implementation of filter.

Sure, there is a new while loop in testAndIterate, but rewriting that with recursion is just silly, in my opinion. What makes D great is that you can use the right tool for the job. Functional programming is fine and dandy a lot of the time, but sometimes it’s not. If a bit of inline assembly would be necessary or nicer, use that.

The call to Filter still does not look very nice. Assuming, you are used to reading from left to right, Filter comes before Iota, even though it is executed after Iota. D has no pipe operator, but it does have Uniform Function Call Syntax (UFCS). If an expression can be implicitly converted to the first parameter of a function, the function can be called like it is a member function of the type of the expression. That’s a lot of words, I know. An example helps:

string foo(string a) {
	return a ~ "World";
}

unittest {
	string a = foo("Hello ");
	string b = "Hello ".foo();
	assert(a == b);
}

The above example shows two calls to the function foo. As the assert indicates, both calls are equivalent. What does that mean for our Iota Filter example? UFCS allows us to rewrite the unit test to:

unittest {
	foreach(it; Iota(1,10).Filter(&isEven)) {
		writeln(it);
	}
}

Implementing a map/transform range should now be possible for every reader. Sure, Filter can be made more abstract through the use of templates, but that’s just work, nothing conceptually new.

Of course, there are different kinds of ranges, like a bidirectional range. Guess what that allows you to do. A small tip: a bidirectional range has two new primitives called back and popBack. There are other range types as well, but after you understand the input range demonstrated twice above, you pretty much know them all.

P.S. Just to be clear, do not implement your own filter, map, or fold; the D standard library Phobos has everything you every need. Have a look at std.algorithm and std.range.

About the Author

Robert Schadek received a master’s degree in Computer Science at the University of Oldenburg. His master thesis was titled “DMCD A Distributed Multithreading Caching D Compiler” where he work on building a D compiler from scratch. He was a computer science PhD student from 2012–2018 at the University of Oldenburg. His PhD research focuses on quorum systems in combination with graphs. Since 2018 he is happily using D in his day job working for Symmetry Investments.

What is Symmetry Investments?

Symmetry Investments is a global investment company with offices in Hong Kong, Singapore and London. We have been in business since 2014 after successfully spinning off from a major New York-based hedge fund.

At Symmetry, we seek to engage in intelligent risk-taking to create value for our clients, partners and employees. We derive our edge from our capacity to generate Win-Wins – in the broadest sense. Win-Win is our fundamental ethical and strategic principle. By generating Win-Wins, we can create unique solutions that reconcile perspectives that are usually seen as incompatible or opposites, and encompass the best that each side has to offer. We integrate fixed-income arbitrage with global macro strategies in a novel way. We invent and develop technology that focuses on the potential of human-machine integration. We build systems where machines do what they do best, supporting people to do what people do best. We are creating a collaborative meritocracy: a culture where individual contribution serves both personal and collective goals – and is rewarded accordingly. We value both ownership thinking AND cooperative team spirit, self-realisation AND community.

People at Symmetry Investments have been active participants in the D community since 2014. We have sponsored the development of excel-d, dpp, autowrap, libmir, and various other projects. We started Symmetry Autumn of Code in 2018 and hosted DConf 2019 in London.

D For Data Science: Calling R from D

Digital Mars D logoD is a good language for data science. The advantages include a pleasant syntax, interoperability with C (in many cases as simple as adding an #include directive to import a C header file via the dpp tool), C-like speed, a large standard library, static typing, built-in unit tests and documentation generation, and a garbage collector that’s there when you want it but can be avoided when you don’t.

Library selection for data science is a different story. Although there are some libraries available, such as those provided by the mir project, the available functionality is extremely limited compared with languages like R and Python. The good news is that it’s possible to call functions in either language from D.

This article shows how to embed an R interpreter inside a D program, pass data between the two languages, execute arbitrary R code from within a D program, and call the R interface to C, C++, and Fortran libraries from D. Although I only provide examples for Linux, the same steps apply for Windows if you’re using WSL, and with minor modifications to the DUB package file, everything should work on macOS. Although it is possible to do so, I don’t talk about calling D functions from R, and I don’t include any discussion of interoperability with Python. (This is normally done using pyd.)

Dependencies

The following three dependencies should be installed:

  • R
  • R package RInsideC
  • R package embedr

It’s assumed that anyone reading this post already has R installed or can install it if they don’t. The RInsideC package is a slightly modified version of the excellent RInside project of Dirk Eddelbuettel and Romain Francois. RInside provides a C++ interface to R. The modifications provide a C interface so that R can be called from any language capable of calling C functions. Install the package using devtools:

library(devtools)
install_bitbucket("bachmeil/rinsidec")

The embedr package provides the necessary functions to work with R from within D. That package is also installed with devtools:

install_bitbucket("bachmeil/embedr")

A First Program

The easiest way to do the compilation is to use D’s package manager, called DUB. From within your project directory, open R and create a project skeleton:

library(embedr)
dubNew()

This will create a /src subdirectory to hold your project’s source code if it doesn’t already exist, add a file called r.d to /src and create a dub.sdl file in the project directory. Create a file in the /src directory called hello.d, containing the following program:

import embedr.r;

void main() {
  evalRQ(`print("Hello, World!")`);
}

From the terminal, in the project directory (the one holding dub.sdl, not the /src subdirectory), enter

dub run

This will print out “Hello, World!”. The important thing to realize is that even though you just used DUB to compile and run a D program, it was R that printed “Hello, World!” to the screen.

Executing R Code From D

There are two ways to execute R code from a D program. evalR executes a string in R and returns the output to D, while evalRQ does the same thing but suppresses the output. evalRQ also accepts an array of strings that are executed sequentially.

Create a new project directory and run dubNew inside it, as you did for the first example. In the src/ subdirectory, add a file named reval.d:

import embedr.r;
import std.stdio;

void main() {
  // Example 1
  evalRQ(`print(3+2)`); // evaluates to 5 in R, R prints the output [1] 5 to the screen

  // Example 2
  writeln(evalR(`3+2`).scalar); // evaluates to 5 in R, output is 5

  // Example 3
  evalRQ(`3+2`); // evaluates to 5 in R, but there is no output

  // Example 4
  evalRQ([`x <- 3`, `y <- 2`, `z <- x+y`, `print(z)`]); // evaluates this code in R
}

Example 1 tells R to print the sum of 3 and 2. Because we use evalRQ, no output is returned to D, but R is able to print to the screen. Example 2 evaluates 3+2 in R and returns the output to D in the form of an Robj. evalR(``3+2``).scalar executes 3+2 inside R, captures the output in an Robj, and converts the Robj into a double holding the value 5. This value is passed to the writeln function and printed to the screen. Example 3 doesn’t output anything, because evalRQ does not return any output, and R isn’t being told to print anything to the screen. Example 4 executes the four strings in the array sequentially, returning nothing to D, but the last tells R to print the value of z to the screen.

There’s not much more to say about executing R code from D. You can execute any valid R code from D, and if there’s an error, it will be caught and printed to the screen. Graphical output is automatically captured in a PDF file. To work interactively with R, or if it’s sufficient to save the results to a text file and read them into D, this is all you need to know. The more interesting cases involve passing data between D and R, and for the times when there is no alternative, using the R interface to call directly into C, C++, or Fortran libraries.

Passing Data Between D and R

A little background is needed to understand how to pass data between D and R. Everything in R is represented as a C struct named SEXPREC, and a pointer to a SEXPREC struct is called a SEXP in the R source code. Those names reflect R’s origin as a Scheme dialect, where code takes the form of s-expressions. In order to avoid misunderstanding, embedr uses the name Robj instead of SEXP.

It’s necessary to let R allocate the memory for any data passed to R. For instance, you cannot tell D to allocate a double[] array and then pass a pointer to that array to R. You would instead do something like this:

auto v = RVector(100);
foreach(ii; 0..100) {
  v[ii] = 1.5*ii;
}
v.toR("vv");
evalRQ(`print(vv)`);

The first line tells R to allocate a vector with room for 100 elements. v is a D struct holding a pointer to the memory allocated by R plus additional information that allows you to read and change the elements of the vector. Behind the scenes, the RVector struct protects the vector from R’s garbage collector. R is a garbage collected language, and if the only reference to the data is in your D program, there’s nothing to prevent the R garbage collector from freeing that memory. The RVector struct uses the reference counting mechanism described in Adam Ruppe’s D Cookbook to protect objects from R’s garbage collector and unprotect them when they’re no longer in use.

After filling in all 100 elements of v, the toR function creates a new variable in R called vv, and associates it with the vector held inside v. The final line tells R to print out the variable vv.

In practice, no data is ever passed between D and R. The only thing that’s passed around is a single pointer to the memory allocated by R. That means it’s practical to call R functions from D even for very large datasets.

Calling the R API

The R API provides a convenient (by C standards) interface to some of R’s functions and constants, including the numerical optimization routines underlying optim, distribution functions, and random number generators. This example shows how to solve an unconstrained nonlinear optimization problem using the Nelder-Mead algorithm, which is the default when calling optim in R.

The objective function is

f = x^2 + y^2

We want to choose x and y to minimize f. The obvious solution is x=0 and y=0.

Create a new project directory and initialize DUB from within R, with the one additional step to add the wrapper for R’s optimization libraries:

library(embedr)
dubNew()
dubOptim()

dubOptim() adds the file optim.d to the src/ directory. Create a file called nelder.d inside the src directory with the following program:

import embedr.r, embedr.optim;
import std.stdio;

extern(C) {
  double f(int n, double * par, void * ex) {
    return par[0]*par[0] + par[1]*par[1];
  }
}

void main() {
  auto nm = NelderMead(&f);
  OptimSolution sol = nm.solve([3.5, -5.5]);
  sol.print;
}

First we define the objective function, f, using the C calling convention so it can be passed to various C functions. We then create a new struct called NelderMead, passing a pointer to f to its constructor. Finally, we call the solve method, using [3.5, -5.5] as the array of starting values, and print out the solution. You’ll want to confirm that the failure code in the output is false (implying the convergence criterion was met). The most common reason that Nelder-Mead will fail to converge is because it took too many iterations. To change the maximum number of iterations to 10,000, you’d add nm.maxit = 10_000; to your program before the call to nm.solve.

There’s no overhead associated with calling an interpreted language in this example. We’re calling a C shared library directly, and at no point does the R interpreter get involved. As in the previous example, since there’s no copying of data, this approach is efficient even for large datasets. Finally, if you’re not comfortable with garbage collection, the inner loops of the optimization are done entirely in C. We nonetheless do take advantage of the convenience and safety of D’s garbage collector when allocating the nm and sol structs, as the performance advantages of manual memory management (to the extent that there are any) are irrelevant.

Calling R Interfaces from D

The purpose of many R packages is to provide a convenient interface to a C, C++, or Fortran library. The term “R interface” normally means one of two things. For modern C or C++ code, it’s a function taking Robj structs as arguments and returning one Robj struct as the output. For Fortran code and older C or C++ code, it’s a void function taking pointers as arguments. In either case, you can call the R interface directly from D code, meaning any library with an R interface also has a D interface.

An example of an R interface to Fortran code is found in the popular glmnet package.
Lasso estimation using the elnet function is done by passing 28 pointers to the function elnet in libglmnet.so with this interface:

.Fortran("elnet", ka, parm=alpha, nobs, nvars, as.double(x), y,
                  weights, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh,
                  isd, intr, maxit, lmu=integer(1), a0=double(nlam),
                  ca=double(nx*nlam), ia=integer(nx), nin=integer(nlam),
                  rsq=double(nlam), alm=double(nlam), nlp=integer(1),
                  jerr=integer(1), PACKAGE="glmnet")

You might want to work with the R interface directly if you’re calling elnet inside a loop in your D program. Most of the time it’s better to pass the data to R and then call the R function that calls elnet. Calling Fortran functions can be error-prone, leading to hard to debug segmentation faults.

Conclusion

D was designed from the beginning to be compatible with the C ABI. The intention was to facilitate the integration of new D code into existing C code bases. The practical result has been that, due to C’s lingua franca status, D can be used in combination with myriad languages. Data scientists looking for alternatives to C and C++ when working with R may find benefit in giving D a close look.

Lance Bachmeier is an associate professor of economics at Kansas State University and co-editor of the journal Energy Economics. He does research on macroeconomics and energy economics. He has been using the D programming language in his research since 2013.

Saving Money by Switching from PHP to D

2night was born in 2000 as an online magazine focused on nightlife and restaurants in Italy. Over the years, we have evolved into a full-blown experiential marketing agency, keeping up our vocation of spreading what’s cool to do when you go out, but specialized in producing brand events and below-the-line unconventional marketing campaigns.

We started using D at 2night in 2012 when we developed a webservice used by our Android and iOS apps. It has worked fine since then, but it was just a small experiment. In 2019, after many other experiments, we decided to take the big step: we switched the complete website from PHP to D. The time was right; we had been planning to give our website a new look and we took this opportunity to rewrite the entire infrastructure.

Development

The job turned out to be easier than we had imagined. We implemented a small D backend over our Mongo database in a few hundred lines. We created a Simple Common Gateway Interface (SCGI) library to interface with the NGINX server and another library to work with the DOM. Using the HTML DOM instead of an obscure HTML template language helped us speed up development a lot. In this way, someone who works on HTML or JavaScript is not required to know D or any template language and can deploy plain HTML and CSS files. On the other hand, someone who works on the backend does not care so much about HTML tags since he can simply access elements by ID, class, etc.; if some HTML tags are moved around the page the whole thing still works. HTML+CSS+JavaScript on the frontend and D on the backend are totally independent.

Writing code in this way is quite simple. Let’s say we want to build a blog page. We start from a simple HTML file like this:

<!DOCTYPE html>
<html lang="en">
  <head><title>Test page</title></head>
  <body>

    <!-- Main post -->
    <h1>Post title</h1>
    <h2>The optional subheading</h2>
    <p>
      Lorem ipsum dolor sit amet, consectetur adipiscing elit.
      Proin a velit tempus, eleifend ex non, aliquam ipsum.
      Nullam molestie enim leo, viverra finibus diam faucibus a.
      Ut dapibus a orci in eleifend.
    </p>

    <!-- Two more posts -->
    <div id="others">
      <h3>Other posts</h3>

      <div>
        <h4>Post#2</h4>
        <p>
          Morbi tempus pretium metus, et aliquet dolor.
          Duis venenatis convallis nisi, auctor elementum augue rutrum in.
          Quisque euismod vestibulum velit id pharetra.
          Morbi hendrerit faucibus sem, ac tristique libero...
        </p>
      </div>

      <div>
        <h4>Post #3</h4>
        <p>Sed sit amet vehicula nisl. Nulla in mi est.
          Vivamus mollis purus eu magna ullamcorper, eget posuere metus sodales.
          Vestibulum ipsum ligula, vehicula sit amet libero at, elementum vestibulum mi.
        </p>
      </div>
    </div>

  </body>
</html>

This is a valid HTML5 file that can be edited by anyone who knows HTML. Now we have to fill this template with real data from a database, which we can represent as an array in this example for the sake of simplicity:

// A blog post
struct SimplePost
{
  string heading;
  string subheading;
  string text;
  string uri;
}

SimplePost[] posts = [
  SimplePost("D is awesome!", "This is a real subheading", "Original content was replaced", "http://dlang.org"),
  SimplePost("Example post #1", "Example subheading #1", "Random text #1"),
  SimplePost("Example post #2", "Example subheading #2", "Random text #2"),
  SimplePost("Example post #3", "Example subheading #3", "This will never be shown")
];

First, we must read our HTML template just as it is and parse it using our html5 library:

  auto page = readText("html/test.html");

  // Parse the source
  auto dom = parser.parse(page);

Then we replace the content of the main article with data from the first element of our array. We use the tag name in order to select the correct HTML element:

  // Take the first element from our data source
  auto mainPost = posts.front;

  // Update rendered data of main post
  dom.byTagName("h1").front.innerText = mainPost.heading;
  dom.byTagName("p").front.innerText = mainPost.text;
  dom.byTagName("a").front["href"] = mainPost.uri;

We want to check if our article has a subtitle. If it doesn’t we’re going to remove the related tag.

  // If we have a subtitle we show it. If not, we remove the node from our page
  if (mainPost.subheading.empty) dom.byTagName("h2").front.detach();
  else dom.byTagName("h2").front.innerText = mainPost.subheading;

If you wanted to get the same result with a template language, you’d probably need to mess up the HTML with something like this:

<!-- We don't like this! -->
<? if(!post.subheading.isEmpty) ?>
<h2><?= post.subheading ?></h2>
<? endif ?>

This mixes logic inside the view and it disrupts the whole HTML file. Anyone who works on the HTML frontend is supposed to know what post is, the logic behind this object, and the template language itself. Last but not least, many HTML editors would probably be driven crazy by any custom syntax. And this is still a simple case!

Going back to our example, to fill the last part of our page we must get the container from the DOM. All we need is to perform a search by ID on the DOM:

auto container = dom.byId("others").front;

Now we use the first element inside the container as a template. So we clone it and we empty the container itself:

  // Use the first children as template
  auto containerItems = container.byCssSelector(`div[id="others"] > div`);
  auto otherPostTemplate = containerItems.front.clone();

  // Remove all existing children from container
  containerItems.each!(item => item.detach);

Finally we add a new child to the container for each post in our data source:

  // Take 2 more posts from list. We drop the first, it's the main one.
  foreach(post; posts.drop(1).take(2))
  {
    // Clone our html template
    auto newOtherPost = otherPostTemplate.clone();

    // Update it with our data
    newOtherPost.byTagName("h4").front.innerText = post.heading;
    newOtherPost.byTagName("p").front.innerText = post.text;

    // Add it to html container
    container.appendChild(newOtherPost);
  }

Putting it all together:

import std;
import arrogant;

// Init
auto parser = Arrogant();

// A blog post
struct SimplePost
{
  string heading;
  string subheading;
  string text;
  string uri;
}

/*
  Of course real data should come from a db query.
  We're using an array for simplicity
*/
SimplePost[] posts = [
  SimplePost("D is awesome!", "This is a real subheading", "Original content was replaced", "http://dlang.org"),
  SimplePost("Example post #1", "Example subheading #1", "Random text #1"),
  SimplePost("Example post #2", "Example subheading #2", "Random text #2"),
  SimplePost("Example post #3", "Example subheading #3", "This will never be shown")
];

void main()
{
  // Our template from disk
  auto page = readText("html/test.html");

  // Parse the source
    auto dom = parser.parse(page);

  // Take the first element from our data source
  auto mainPost = posts.front;

  // Update rendered data of main post
  dom.byTagName("h1").front.innerText = mainPost.heading;
  dom.byTagName("p").front.innerText = mainPost.text;
  dom.byTagName("a").front["href"] = mainPost.uri;

  // If we have a subtitle we show it. If not, we remove the node from our page
  if (mainPost.subheading.empty) dom.byTagName("h2").front.detach();
  else dom.byTagName("h2").front.innerText = mainPost.subheading;

  // -----
  // Other articles
  // -----

  // Get the container
  auto container = dom.byId("others").front;

  // Use the first children as template
  auto containerItems = container.byCssSelector(`div[id="others"] > div`);
  auto otherPostTemplate = containerItems.front.clone();

  containerItems.each!(item => item.detach);

  // Take 2 more posts from list. We drop the first, it's the main one.
  foreach(post; posts.drop(1).take(2))
  {
    // Clone our html template
    auto newOtherPost = otherPostTemplate.clone();

    // Update it with our data
    newOtherPost.byTagName("h4").front.innerText = post.heading;
    newOtherPost.byTagName("p").front.innerText = post.text;

    // Add it to html container
    container.appendChild(newOtherPost);
  }

  writeln(dom.document);

}

This program will output a new valid HTML5 page like this:

<!DOCTYPE html>
<html lang="en">
  <head><title>Test page</title></head>
  <body>
    <h1>D is awesome!</h1>
    <h2>This is a real subheading</h2>
    <p>Original content was replaced</p>
    <a href="http://dlang.org">More...</a>
    <h3>Other posts</h3>
    <div id="others">
      <div>
        <h4>Example post #1</h4>
        <p>Random text #1</p>
      </div>
      <div>
        <h4>Example post #2</h4>
        <p>Random text #2</p>
      </div>
    </div>
  </body>
</html>

Of course, the same results could be achieved in many other ways and in other languages too. Our library is just a wrapper over a plain C library named Modest. But what really makes the difference is how easy it is to write and read code thanks to D’s powerful and easy-to-understand syntax. The code shown above can be easily understood by anyone has some programming experience. I’ve received pull requests for our project from colleagues who had never heard of D at all.

That’s only one part of the big picture since we’re using many different libraries for different purposes.

Performance

Obviously, performance was a big win. The website felt like it was running on local machines, bringing a dramatic increase to speed and lower latency across the board. After the switch, at first the load on our cloud servers was so low that we thought the website was down! Switching from PHP to D meant we could cut in half the instance size of each Amazon AWS machine in our cloud. And these machines are still underloaded. Our database cloud was highly affected by this too. We now use one quarter of its original computational power. All of this brought an instantaneous and dramatic cost savings, down to more than half of what our costs used to be.

One more thing…

A few days after launch we realized that some of our costs were rising anyway. We were relying on a third-party service to host and cut the pictures we display on the website. This is not a simple task; in order to crop a picture correctly, you need to know where the subjects of the picture are located and you must try to keep them inside the trimmed frame. On the legacy website we mostly used a fixed proportion for images and we used a third-party service for some special cases. The new version of 2night.it has several different possible cuts for each “master” picture, and this raised the costs by 15x! Luckily, we found that a D binding to the OpenCV API is available. We used this to develop a smart algorithm that can cut any photo while preserving the subject of the picture. And again, the performance of our service is so impressive that we do not need a new machine to host it. In a week or so the costs for pictures dropped from some thousands of euros per month to almost 0.