DCompute: Running D on the GPU

Posted on

Nicholas Wilson is a student at Murdoch University, studying for his BEng (Hons)/BSc in Industrial Computer Systems (Hons) and Instrumentation & Control/ Molecular Biology & Genetics and Biomedical Science. He just finished his thesis on low-cost defect detection of solar cells by electroluminescence imaging, which gives him time to work on DCompute and write about it for the D Blog.He plays the piano, ice skates, and has spent 7 years putting D to use on number bashing, automation, and anything else that he could make a computer do for him.


DCompute is a framework and compiler extension to support writing native kernels for OpenCL and CUDA in D to utilize GPUs and other accelerators for computationally intensive code. Its compute API drivers automate the interactions between user code and the tedious and error prone APIs with the goal of enabling the rapid development of high performance D libraries and applications.

Introduction

This is the second article on DCompute. In the previous article, we looked at the development of DCompute and some trivial examples. While we were able to successfully build kernels, there was no way to run them short of using them with an existing framework or doing everything yourself. This is no longer the case. As of v0.1.0, DCompute now comes with native wrappers for both OpenCL and CUDA, enabling kernel dispatch as easily as CUDA.

In order to run a kernel we need to pass it off to the appropriate compute API, either CUDA or OpenCL. While these APIs both try to achieve similar things they are different enough that to squeeze that last bit of performance out of them you need to treat each API separately. But there is sufficient overlap that we can make the interface reasonably consistent between the two. The C bindings to these APIs, however, are very low level and trying to use them is very tedious and extremely prone to error (yay void*).
In addition to the tedium and error proneness, you have to redundantly specify a lot of information, which further compounds the problem. Fortunately this is D and we can remove a lot of the redundancy through introspection and code generation.

The drivers wrap the C API, providing a clean and consistent interface that’s easy to use. While the documentation is a little sparse at the moment, the source code is for the most part straightforward (if you’re familiar with the C APIs, looking where a function is used is a good place to start). There is the occasional piece of magic to achieve a sane API.

Taming the beasts

OpenCL’s clGet*Info functions are the way to access properties of the class hidden behind the void*. A typical call looks like

enum CL_FOO_REFERENCE_COUNT = 0x1234;
cl_foo* foo = ...; 
cl_int refCount;
clGetFooInfo(foo, CL_FOO_REFERENCE_COUNT, refCount.sizeof, &refCount,null);

And that’s not even one for which you have to call, to figure out how much memory you need to allocate, then call again with the allocated buffer (and $DEITY help you if you want to get a cl_program’s binaries).

Using D, I have been able to turn that into this:

struct Foo
{
    void* raw;
    static struct Info
    {
        @(0x1234) int referenceCount;
        ...
    }
    mixin(generateGetInfo!(Info, clGetFooInfo));
}

Foo foo  = ...;
int refCount = foo.referenceCount;

All the magic is in generateGetInfo to generate a property for each member in Foo.Info, enabling much better scalability and bonus documentation.

CUDA also has properties exposed in a similar manner, however they are not essential (unlike OpenCL) for getting things done so their development has been deferred.

Launching a kernel is a large point of pain when dealing with the C API of both OpenCL and (only marginally less horrible) CUDA, due to the complete lack of type safety and having to use the & operator into a void* far too much. In DCompute this incantation simply becomes

Event e = q.enqueue!(saxpy)([N])(b_res, alpha, b_x, b_y, N);

for OpenCL (1D with N work items), and

q.enqueue!(saxpy)([N, 1, 1], [1 ,1 ,1])(b_res, alpha, b_x, b_y, N);

for CUDA (equivalent to saxpy<<<N,1,0,q>>>(b_res,alpha,b_x,b_y, N);)

Where q is a queue, N is the length of buffers (b_res, b_x & b_y) and saxpy (single-precision a x plus y) is the kernel in this example. A full example may be found here, along with the magic that drives the OpenCL and CUDA enqueue functions.

The future of DCompute

While DCompute is functional, there is still much to do. The drivers still need some polish and user testing, and I need to set up continuous integration. A driver that unifies the different compute APIs is also in the works so that we can be even more cross-platform than the industry cross-platform standard.

Being able to convert SPIR-V into SPIR would enable targeting cl_khr_spir-capable 1.x and 2.0 CL implementations, dramatically increasing the number of devices that can run D kernel code (there’s nothing stopping you using the OpenCL driver for other kernels though).

On the compiler side of things, supporting OpenCL image and CUDA texture & surface operations in LDC would increase the applicability of the kernels that could be written.
I currently maintain a forward-ported fork of Khronos’s SPIR-V LLVM to generate SPIR-V from LLVM IR. I plan to use IWOCL to coordinate efforts to merge it into the LLVM trunk, and in doing so, remove the need for some of the hacks in place to deal with the oddities of the SPIR-V backend.

Using DCompute in your projects

If you want to use DCompute, you’ll need a recent LDC built against LLVM with the NVPTX (for CUDA) and/or SPIRV (for OpenCL 2.1+) targets enabled and should add "dcompute": "~>0.1.0" to your dub.json. LDC 1.4+ releases have NVPTX enabled. If you want to target OpenCL, you’ll need to build LDC yourself against my fork of LLVM.

DMD, Windows, and C

Posted on

The ability to interface with C was baked into D from the beginning. Most of the time, it’s something that requires little thought – as long as the declarations on the D side match what exists on the C side, things will usually just work. However, there are a few corner-case gotchas that arise from the simple fact that D, though compatible, is not C.

An upcoming series of posts here on the D blog will delve into some of these dark corners and shine a light on the traps lying in wait. In these posts, readers will be asked to follow along by compiling and executing the examples themselves so they may more thoroughly understand the issues discussed. This means that, in addition to a D compiler, readers will need access to a C compiler.

That raises a potential snafu. On the systems that the DMD frontend groups under the version(Posix) umbrella, it’s a reliable assumption that a C compiler is easily available (if a D compiler is installed and functioning properly, the C compiler will already be installed). The concept of a system compiler is a long established tradition on those systems. On Windows… not so much.

So before diving into a series about C and D, a bit of a primer is called for. That’s where this post comes in. The primary goal is to help ensure a C environment is installed and working on Windows. It’s also useful to understand why things are different on that platform than on the others. Before we get to the why, we’ll dig into the how.

First, assume we have the following two source files in the same directory.

cfoo.c

#include <stdio.h>

void say_hello(void) 
{
    puts("Hello!");
}

dfoo.d

extern(C) void say_hello();

void main() 
{
    say_hello();
}

Now let’s see how to get the two working together.

DMD and C

The DMD packages for Windows ship with everything the compiler needs: a linker and other tools, plus a handful of critical system libraries. So on the one hand, Windows is the only platform where DMD has no external dependencies out of the box. On the other hand, it’s the only platform where a working DMD installation does not imply a C compiler is also installed. And these days, the out-of-the-box experience often isn’t the one you want.

On all the other platforms, the C compiler option is the system compiler, which in practice means GCC or Clang. The system linker to which DMD sends its generated object files might be ld, lld, ld.gold, or any ld-compatible linker. On Windows, there are currently two compiler choices, and neither can be assumed to be installed by default: the Digital Mars C and C++ compiler, dmc, or the Microsoft compiler, cl. We’ll look at each in turn.

DMD and DMC

The linker (Optlink) and other tools that ship with DMD are also part of the DMC distribution. DMD uses these tools by default (or when the -m32 switch is passed on the command line). To link any C objects or static libraries, they should be in the OMF format. Compilers that can generate OMF are a rare breed these days, and while something like Open Watcom may work, using DMC will guarantee 100% compatibility.

The DMC package (version 8.57 as I write) can be downloaded from digitalmars.com. It’s a 3 MB zip file that can be unzipped anywhere. Personally, since I only ever use it in conjunction with DMD, I keep it in C:\D so that the dm directory is a sibling of the dmd2 directory. Once it’s unzipped, it can be added to the path if desired. Be aware that some of the tools DMD and DMC ship with may conflict with tools in other packages if they are on the global path.

For example, both come with Digital Mars make and Optlink, which is named link.exe. The former might conflict with Cygwin, or a MinGW distribution that’s independent of MSYS2 (if mingw32-make has been renamed), and the latter with Microsoft’s linker (which generally shouldn’t be on the global path anyway). Some may prefer just to keep it all off the global path. In that case, it’s simple to configure a command prompt shortcut that sets the PATH when it launches. For example, create a batch file, that looks like this:

echo Welcome to your Digital Mars environment.
@set PATH=C:\D\dmd2\windows\bin;C:\D\dm\bin;%PATH%

Save it as C:\D\dmenv.bat. Right click an empty spot on the desktop and, from the popup menu, select New->Shortcut. In the location field, enter the following:

C:\System\Win32\cmd.exe /k C:\d\dmenv.bat

Now you have a shortcut that, when double clicked, will launch a command prompt that has both dmd and dmc on the path.

Once installed, documentation on the command-line switches for the tools is available at the Digital Mars site. The most relevant are the docs for DMC, Optlink, and Librarian (lib.exe). The latter two will come in handy even when doing pure D development with vanilla DMD, as those are the tools needed to when manually manipulating its object file output.

That’s all there is to it. As long as both dmc.exe and dmd.exe are on the path in any given command prompt, both compilers will find the tools they need via the default settings in their configuration files. For knocking together quick tests with both C and D on Windows, it’s a quick thing to launch a command prompt, compile & link, and execute:

dmc -c cfoo.c
dmd dfoo.d cfoo.obj
dfoo

Easy peasy. Now let’s look at the other option.

DMD and Microsoft’s CL

Getting DMD to work with the Microsoft toolchain requires installing the Microsoft build tools and the Windows SDK. The easiest way to get everything is to use one of the Community editions of Visual Studio. The installer will download and install all the tools and the SDK. The latest is always available from https://www/visualstudio.com. With VS 2017, the installer has been overhauled such that it’s possible to minimize the size of the install more than was possible with past editions. An alternative is to install the Microsoft Build Tools and the Windows SDK separately. However, this is still a large install that isn’t much of a win in light of the new VS 2017 installer options (for those on Windows 8.1 or 10).

Once the tooling is installed, DMD’s configuration file needs to be modified to point its environment variables to the proper locations. Rather than repeat all of that here, I’ll direct you to the DMD installation page at the D Wiki. One of the reasons to prefer the DMD installer over the zip archive is that it will detect any installation of Visual Studio or the Microsoft Build Tools and automatically modify the configuration as needed. This is more convenient than needing to remember to update the configuration every time a new version of DMD is installed. It also offers to install VS 2013 Community if the tooling isn’t found, can install Visual D (the D plugin for Visual Studio), and will add DMD to the system path if you want it to.

It’s a bit of an annoyance to launch Visual Studio for simple tests between C and D. Since it’s not recommended to put the MS tools on the system path, each VS and Microsoft Build Tools installation ships with a number of batch files that will set the path for you (like the one we created for DMC above). The installer sets up shortcuts in the Windows Start menu. There are several different options to choose from. To launch a 64-bit environment with VS 2017 (or the 2017 build tools), find Visual Studio 2017 in the Start menu and select x64 Native Tools Command Prompt for VS 2017. For VS 2015 (or the 2015 build tools), go to Visual Studio 2015 and click on VS 2015 x64 Native Build Tools Command Prompt. Similar options exist for 32-bit (where x86 replaces x64) and cross compiling.

From the VS-enabled 64-bit environment, we can run the following commands to compile our two files.

cl /c cfoo.c
dmd -m64 dfoo.d cfoo.obj
dfoo

In a 32-bit VS environment, replace -m64 with -m32mscoff.

The consequences of history

When a new programming language is born these days, it’s not uncommon for its tooling to be built on top of an existing toolchain rather than completely from scratch. Whether we’re talking about languages like Kotlin built on the JRE, or those like Rust using LLVM, reusing existing tools saves time and allows the developers to focus their precious man-hours on the language itself and any language-specific tooling they require.

When Walter Bright first started putting D together in 1999, that trend had not yet come around. However, he already had an existing toolchain in the form of the Digital Mars C and C++ compiler tools. So it was a no-brainer to make use of his existing tools and compiler backend and just focus on making a new frontend for DMD. There were four major side-effects of this decision, all of which had varying consequences in D’s future development.

First, the DMC tools were Windows-only, so the early versions of DMD would be as well. Second, the linker, Optlink, only supports the OMF format. That meant that DMD’s output would be incompatible with the more common COFF output of most modern C and C++ compilers on Windows. Third, the DMC tools do not support 64-bit, so DMD would be restricted to 32-bit output. Finally, Symantec had the legal rights to the existing backend, which meant their license would apply to DMD. While the frontend was open source, the backend license required one to get permission from Walter to distribute DMD (on a side note, this prevented DMD from being included in official Linux package repositories once Linux support was added, but Symantec granted permission to relicense the backend earlier this year and it is now freely distributable under the Boost license).

DMD 0.00 was released in December of 2001. The 0.63 release brought Linux support in May of 2003. Walter could have based the Linux version on the GCC backend, but as a business owner, and through a caution born from past experience, he was concerned about any legal issues that could arise from his working with GPL code on one platform and maintaining a proprietary backend on another. Instead, he modified the DMD backend to generate ELF objects and hand them off to the GCC tools. This decision to enhance the backend became the approach for all new formats going forward. He did the same when adding support for Mac OS X: he modified the backend to work with the Mach-O object format.

Along with the new formats, the compiler gained the ability to generate 64-bit binaries everywhere except Windows. In order to interface with C on Windows, it was usually necessary to convert COFF object files and static libraries to OMF, to use a tool like coffimplib to generate DLL import libraries in the OMF format, or to create dynamic bindings and load DLLs manually via LoadLibrary and GetProcAddress. Then Remedy Games decided to use D.

Quantum Break was the first AAA game title to ship with D as part of its development process. Remedy used it for their gameplay code, creating their own open source tool to bind with their C++ game engine. Before they could get that far, however, they needed 64-bit support in DMD on Windows. That was the motivator to get it implemented. It took a while (apparently, there are some undocumented quirks in Microsoft’s variant of COFF, a.k.a PECOFF, a.k.a. MS-COFF), but Walter eventually got it done, and support for 32-bit COFF along with it. Again, as he had on other platforms, he modified the backend to generate object files in the new format.

This is why it’s necessary to have the Microsoft toolchain installed in order to produce 64-bit binaries with DMD on Windows. Microsoft’s cl is as close to a system compiler as one is going to get on Windows. There is, however, an option that has not yet been fully explored. It’s a toolchain that can be freely distributed, packaged with a reasonable download size, supports 32-bit and 64-bit output, and is mostly compatible with PECOFF. There is a possibility that it may be investigated as an option for future DMD releases to be built upon.

Going from here

Now that this primer is out of the way, the short series on C is just about ready to go. It will kick off with a brief summary of existing material, showing how easy it is to get D and C to work together in the general case. That will be followed up by two posts on arrays and strings. This is where most of the gotchas come into play, and anyone using D and C in the same program should understand what they are and how to avoid them.

Unit Testing In Action

Posted on

Mario Kröplin is a developer at Funkwerk AG, a German company whose passenger information system is developed in D and was recently highlighted on this blog. That post describes Funkwerk’s use of third-party unit testing frameworks and says, “the team recently discovered a way to combine xUnit testing with D’s built-in unittest, which may lead to another transition in their unit testing.” That’s Mario’s subject in this post.


There and Back Again

Ten years ago, programming in D was like starting over in our company. And, of course, unit testing was part of it right from the beginning. D’s built-in simple support made it easy to quickly write lots of unit tests. Until some of them failed. And soon, the failure became the rule. There’s always someone else to blame: D’s simple unit-test support is too simple. A look at Python reveals that the modules doctest and unittest live side by side in the standard library. We concluded that D’s unit test support corresponds to Python’s doctest, which means that there must be something else for the real unit testing.

Even back then, we immediately found such a unit testing framework in DUnit [An old D1 unit testing framework that you can read about at the old dsource.org – Ed.]. Thanks to good advice for xUnit testing, we were happy and content with this approach. At the end of life of D1, a replacement library for D2 was soon found. After a bumpy start, I found myself in the role of the maintainer of dunit [A D2 unit-testing framework that is separate from DUnit – Ed].

During DConf 2013, I copied a first example use of user-defined attributes to dunit. This allowed imitating JUnit 4, where, for example, test methods are annotated with @Test. By now, dunit imitates JUnit 5. So if you want to write unit tests in Java style, dunit is a good choice. But which D programmers would want to do that?

Recently, we reconsidered the weaknesses of D’s unit test support. Various solutions have been found to bypass the blockers (described in the following). On the other hand, good guidelines are added, for example, to use attributes even for unittest functions. So we decided to return to making use of D’s built-in unit test support. From our detour we retain some ideas to keep the test implementation maintainable.

Expectations

Whenever a unit test fails at run time, the question is, why? The error message refers to the line number, where you find something like assert(answer == 42). But what is the value of answer if it isn’t 42? The irony is that this need is well understood. If you use a static assert instead, the error message reads like: static assert 54 == 42 is false. The fear of code bloat is the reason why you don’t get this automatically at run time.

If you look at the Language Reference, you will notice that the chapter Unit Tests covers primarily the special unittest function. It is assumed that assert is used for test verification, which is introduced in the chapter Contract Programming. In theory, it’s completely OK to reuse assert for test verification. Any failure reveals a programming error that must be fixed. In practice, however, test expectations are quite different from preconditions, postconditions, and invariants. While the expectations are usually specific (actual == expected) the contracts rather exclude specific values ​​(value != 0 or value !is null).

So there are lots of implementations of templates like assertEquals or test!"==". The problem shows up if you want to have the most helpful error messages: expected 42 but got 54. For this, assertEquals is too symmetrical. In fact, JUnit’s assertEquals(expected, actual) was turned into TestNG’s assertEquals(actual, expected). Even with UFCS (Uniform Function Call Syntax), it is not clear how a.assertEquals(b) should be used. From time to time, programmers don’t write the arguments in the intended order. Then the error messages are the opposite of helpful. They are misleading: expected 54 but got 42.

Fluent assertions avoid this symmetry problem: actual.should.eq(expected) or expect(actual).to.eq(expected) are harder to use incorrectly. Thanks to UFCS and lazy parameters, the implementation in D is no problem. The common criticism is “the natural language formulation is too verbose”, or just “too many dots”. Currently, however, this seems to be the only way to get the most helpful error messages.

The next problem is that string comparisons are seldom as simple as: expected foo but got bar. Non-printable characters or lengthy texts, such as XML or JSON representations, sabotage error messages that were meant to be helpful. This can be avoided by escaping special characters and by showing differences. Finally, this is what the fluent-asserts library does.

Test Execution

At large, we want to get as much information as possible from a failed test run. How many test cases fail? Which test cases fail? Does the happy path fail or rather edge cases? Is it worth addressing the failures, or is it better to undo the change? The approach of stopping on the first error is contrary to these needs. The original idea was to run the unit tests before the start of the actual program. By now, however, separate test runners are often used, which continue in case of a failure. To emphasize this, test expectations usually throw their own exceptions, instead of the unrecoverable AssertError. This change already shows how many test cases fail.

Finding out what’s tested in the failing test cases is more difficult. At best, there are corresponding comments for documented unit tests. But an empty DDoc comment, ///, is all that’s needed to include the body of the unittest function as an example in the documentation. In the worst case, the unit test goes on and on verifying this and that.

The idea of the Sentence Style For Naming Unit Tests is that the name of the test function describes the test case. In D, however, the unittest functions are anonymous. On the other hand, D has user-defined attributes so that you can even use strings for the test description instead of CamelCase names. unit-threaded, for example, shows these string attributes so that you get a good impression of the extent of the problem in case of a failure. In addition, unit-threaded satisfies the requirement to execute test cases selectively. For example, only the one problematic test case or all tests except those tagged as “slow”. It’s promising to use unit-threaded as needed. You let D run the unittest functions as long as they pass. Only for troubleshooting should you switch to unit-threaded. You have to be careful, however, to only use compatible features.

By the way: the parallel test execution (from it’s name, the main goal of unit-threaded) was quite problematic with the first test suite we converted. On the other hand, the speedup was just 10%.

Coverage

The D compiler has built-in code-coverage analysis. The ratio of the lines executed in the test is often used as an indicator for the quality of the tests. (See: Testing in the D Standard Library) A coverage of 100% cannot be achieved, for example, if you have an assert(0). Lower thresholds for the coverage can always be achieved by cheating. The fact that the unittest functions are also incorporated in the coverage is questionable. Imagine that a single line that has not yet been executed requires a lengthy unit test. As a consequence, this new unit test could significantly raise the coverage.

In order to avoid such measurement errors, we decided from the beginning to extract non-trivial unit tests to separate modules. We place these in parallel to the src tree in a unittest directory. Test utilities are also placed in the unittest directory, so that reading the actual code is not encumbered by large version (unittest) sections. (We also have test directories for customer tests.) For the coverage, we only count the modules under src. Code-coverage analysis creates a report file for each module. For a summary, which we output at the end of each successful test run, we have written a simple script. By now, covered is a ready-made solution.

In order to fully exploit the code-coverage analysis, an unusual formatting is required, for example, for the short-circuit evaluation of expressions with &&, ||, and ?:. We hope that dfmt can be changed to reformat the code temporarily.

Fixtures

What can you do to prevent the test implementation from getting out of control? After all, test code is also code that needs to be maintained. Sometimes the test implementation is more obscure than the code being tested.

As a solution the xUnit patterns suggest a structuring of the test implementation as a Four-Phase Test: fixture setup, exercise system under test, result verification, fixture teardown. The term fixture refers to the test context. For JUnit, this is the test class with attributes that are available to all test methods. A method with the annotation @BeforeEach initializes the attributes. This is the fixture setup. Another method with the annotation @AfterEach implements the fixture teardown. All methods annotated with @Test focus on exercise and verification.

At first glance, this approach seems to be incompatible with D’s unittest functions. The unittest functions do not get automatic access to the attributes of a class, even if they are defined in the context of a class. On the other hand, one can mimic the approach, for example, by implementing the fixtures next to the unittest functions as a struct:

unittest
{
    Fixture fixture;
    fixture.setup;
    scope (exit) fixture.teardown;
    (fixture.x * fixture.y).should.eq(42);
}

The test implementation can be improved by executing the fixture setup in the constructor (or in opCall(), since default constructors are disallowed in structs) and the fixture teardown in the destructor:

unittest
{
    with (Fixture())
        (x * y).should.eq(42);
}

The with (Fixture()) pulls the context, in which test methods are executed implicitly in JUnit, explicitly into the unittest function. With this simple pattern you can structure unit tests in a tried and trusted way without having to use a framework for test classes ever again.

Parameterized Tests

A parameterized test is a means to reuse a test implementation with different values ​​or with different types. Within a unittest function this would be no problem. Our goal, however, is to get as much information as possible from a failing test run. For which values ​​or which types does the test fail? unit-threaded provides support for parameterized tests with @Values ​​and @Types. If unit-threaded is not used to run the unittest functions, these test cases do not work at all.

With the new static foreach feature however, it is easy to implement parameterized tests without the support of a framework:

static foreach (i; 0 .. 2)
    static foreach (j; 0 .. 2)
        @(format!"%s + %s == 1"(i, j))
        unittest
        {
            (i + j).should.eq(1);
        }

And if you run the failing test with unit-threaded, the descriptions of the failing test cases reveal the problem without the need to take a look at the test implementation:

0 + 0 == 1: expected 1 but got 0
1 + 1 == 1: expected 1 but got 2

Conclusion

D’s built-in unit test support works best when there are no failures. As shown, however, you do not need to change too much to be able to work properly in situations where you rely on helpful error messages. The imitation of a solution from another programming language is often easy in D. Nevertheless, one should reconsider such solutions from time to time.

If we had a wish, we would want separate libraries for expectations and for test execution. Currently, you get frameworks where not all features are great, or they are overloaded with alternative solutions. Such a separation should probably be supported by the Phobos runtime library. Currently, each framework defines expectations with its own unit test exceptions. In order to combine them, ugly interdependencies are required to match the exceptions thrown in one library to the exceptions caught in another library. A unit test exception in Phobos could avoid this problem.