{"id":1206,"date":"2017-10-30T13:12:23","date_gmt":"2017-10-30T13:12:23","guid":{"rendered":"http:\/\/dlang.org\/blog\/?p=1206"},"modified":"2021-10-08T11:07:26","modified_gmt":"2021-10-08T11:07:26","slug":"d-compute-running-d-on-the-gpu","status":"publish","type":"post","link":"https:\/\/dlang.org\/blog\/2017\/10\/30\/d-compute-running-d-on-the-gpu\/","title":{"rendered":"DCompute: Running D on the GPU"},"content":{"rendered":"<p><em>Nicholas Wilson is a student at Murdoch University, studying for his BEng (Hons)\/BSc in Industrial Computer Systems (Hons) and Instrumentation &amp; Control\/ Molecular Biology &amp; Genetics and Biomedical Science. He just finished his thesis on low-cost defect detection of solar cells by electroluminescence imaging, which gives him time to work on DCompute and write about it for the D Blog.He plays the piano, ice skates, and has spent 7 years putting D to use on number bashing, automation, and anything else that he could make a computer do for him.<\/em><\/p>\n<hr \/>\n<p><img loading=\"lazy\" class=\"alignleft size-full wp-image-978\" src=\"http:\/\/dlang.org\/blog\/wp-content\/uploads\/2017\/07\/ldc.png\" alt=\"\" width=\"160\" height=\"160\" \/><\/p>\n<p>DCompute is a framework and compiler extension to support writing native kernels for OpenCL and CUDA in D to utilize GPUs and other accelerators for computationally intensive code. Its compute API drivers automate the interactions between user code and the tedious and error prone APIs with the goal of enabling the rapid development of high performance D libraries and applications.<\/p>\n<h3 id=\"introduction\">Introduction<\/h3>\n<p>This is the second article on <a href=\"https:\/\/github.com\/libmir\/dcompute\">DCompute<\/a>. In the <a href=\"https:\/\/dlang.org\/blog\/2017\/07\/17\/dcompute-gpgpu-with-native-d-for-opencl-and-cuda\/\">previous article<\/a>, we looked at the development of DCompute and some trivial examples. While we were able to successfully build kernels, there was no way to run them short of using them with an existing framework or doing everything yourself. This is no longer the case. As of <a href=\"https:\/\/github.com\/libmir\/dcompute\/releases\/tag\/v0.1.0\">v0.1.0<\/a>, DCompute now comes with native wrappers for both OpenCL and CUDA, enabling kernel dispatch as easily as CUDA.<\/p>\n<p>In order to run a kernel we need to pass it off to the appropriate compute API, either CUDA or OpenCL. While these APIs both try to achieve similar things they are different enough that to squeeze that last bit of performance out of them you need to treat each API separately. But there is sufficient overlap that we can make the interface reasonably consistent between the two. The C bindings to these APIs, however, are very low level and trying to use them is very tedious and extremely prone to error (yay <code>void*<\/code>).<br \/>\nIn addition to the tedium and error proneness, you have to redundantly specify a lot of information, which further compounds the problem. Fortunately this is D and we can remove a lot of the redundancy through introspection and code generation.<\/p>\n<p>The drivers wrap the C API, providing a clean and consistent interface that\u2019s easy to use. While the documentation is a little sparse at the moment, the source code is for the most part straightforward (if you\u2019re familiar with the C APIs, looking where a function is used is a good place to start). There is the occasional piece of magic to achieve a sane API.<\/p>\n<h3 id=\"tamingthebeasts\">Taming the beasts<\/h3>\n<p>OpenCL\u2019s <code>clGet*Info<\/code> functions are the way to access properties of the class hidden behind the <code>void*<\/code>. A typical call looks like<\/p>\n<pre class=\"prettyprint lang-d\">enum CL_FOO_REFERENCE_COUNT = 0x1234;\r\ncl_foo* foo = ...; \r\ncl_int refCount;\r\nclGetFooInfo(foo, CL_FOO_REFERENCE_COUNT, refCount.sizeof, &amp;refCount,null);\r\n<\/pre>\n<p>And that\u2019s not even one for which you have to call, to figure out how much memory you need to allocate, then call again with the allocated buffer (and $DEITY help you if you want to get a <code>cl_program<\/code>\u2019s binaries).<\/p>\n<p>Using D, I have been able to turn that into this:<\/p>\n<pre class=\"prettyprint lang-d\">struct Foo\r\n{\r\n    void* raw;\r\n    static struct Info\r\n    {\r\n        @(0x1234) int referenceCount;\r\n        ...\r\n    }\r\n    mixin(generateGetInfo!(Info, clGetFooInfo));\r\n}\r\n\r\nFoo foo  = ...;\r\nint refCount = foo.referenceCount;\r\n<\/pre>\n<p>All the magic is in <a href=\"https:\/\/github.com\/libmir\/dcompute\/blob\/master\/source\/dcompute\/driver\/ocl\/util.d\"><code>generateGetInfo<\/code><\/a> to generate a property for each member in <code>Foo.Info<\/code>, enabling much better scalability and bonus documentation.<\/p>\n<p>CUDA also has properties exposed in a similar manner, however they are not essential (unlike OpenCL) for getting things done so their development has been deferred.<\/p>\n<p>Launching a kernel is a large point of pain when dealing with the C API of both OpenCL and (only marginally less horrible) CUDA, due to the complete lack of type safety and having to use the <code>&amp;<\/code> operator into a <code>void*<\/code> far too much. In DCompute this incantation simply becomes<\/p>\n<pre class=\"prettyprint lang-d\">Event e = q.enqueue!(saxpy)([N])(b_res, alpha, b_x, b_y, N);\r\n<\/pre>\n<p>for OpenCL (1D with N work items), and<\/p>\n<pre class=\"prettyprint lang-d\">q.enqueue!(saxpy)([N, 1, 1], [1 ,1 ,1])(b_res, alpha, b_x, b_y, N);\r\n<\/pre>\n<p>for CUDA (equivalent to <code>saxpy&lt;&lt;&lt;N,1,0,q&gt;&gt;&gt;(b_res,alpha,b_x,b_y, N);<\/code>)<\/p>\n<p>Where <code>q<\/code> is a queue, <code>N<\/code> is the length of buffers (<code>b_res<\/code>, <code>b_x<\/code> &amp; <code>b_y<\/code>) and <code>saxpy<\/code> (single-precision <em>a x plus y<\/em>) is the kernel in this example. A full example may be found <a href=\"https:\/\/github.com\/libmir\/dcompute\/blob\/master\/source\/dcompute\/tests\/main.d\">here<\/a>, along with the magic that drives the <a href=\"https:\/\/github.com\/libmir\/dcompute\/blob\/4182fb8e1b2532adee2c6af3859856cc45cad85e\/source\/dcompute\/driver\/ocl\/queue.d#L79\">OpenCL<\/a> and <a href=\"https:\/\/github.com\/libmir\/dcompute\/blob\/4182fb8e1b2532adee2c6af3859856cc45cad85e\/source\/dcompute\/driver\/cuda\/queue.d#L60\">CUDA<\/a> enqueue functions.<\/p>\n<h3 id=\"thefutureofdcompute\">The future of DCompute<\/h3>\n<p>While DCompute is functional, there is still much to do. The drivers still need some polish and user testing, and I need to set up continuous integration. A driver that unifies the different compute APIs is also in the works so that we can be even more cross-platform than the industry cross-platform standard.<\/p>\n<p>Being able to convert <a href=\"https:\/\/www.khronos.org\/spir\/\">SPIR-V into SPIR<\/a> would enable targeting <code>cl_khr_spir<\/code>-capable 1.x and 2.0 CL implementations, dramatically increasing the number of devices that can run D kernel code (there\u2019s nothing stopping you using the OpenCL driver for other kernels though).<\/p>\n<p>On the compiler side of things, supporting OpenCL image and CUDA texture &amp; surface operations in LDC would increase the applicability of the kernels that could be written.<br \/>\nI currently maintain a forward-ported fork of <a href=\"https:\/\/github.com\/KhronosGroup\/SPIRV-LLVM\">Khronos\u2019s SPIR-V LLVM<\/a> to generate SPIR-V from LLVM IR. I plan to use <a href=\"http:\/\/www.iwocl.org\/\">IWOCL<\/a> to coordinate efforts to merge it into the LLVM trunk, and in doing so, remove the need for some of the hacks in place to deal with the oddities of the SPIR-V backend.<\/p>\n<h3 id=\"usingdcomputeinyourprojects\">Using DCompute in your projects<\/h3>\n<p>If you want to use <a href=\"https:\/\/github.com\/libmir\/dcompute\">DCompute<\/a>, you\u2019ll need a recent <a href=\"https:\/\/github.com\/ldc-developers\/ldc\">LDC<\/a> built against LLVM with the <a href=\"https:\/\/llvm.org\/docs\/NVPTXUsage.html\">NVPTX<\/a> (for CUDA) and\/or SPIRV (for OpenCL 2.1+) targets enabled and should add <code>\"dcompute\": \"~&gt;0.1.0\"<\/code> to your <code>dub.json<\/code>. LDC 1.4+ releases have NVPTX enabled. If you want to target OpenCL, you\u2019ll need to build LDC yourself against <a href=\"https:\/\/github.com\/thewilsonator\/llvm\/tree\/compute\">my fork of LLVM<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>DCompute is a framework and compiler extension to support writing native kernels for OpenCL and CUDA in D to utilize GPUs and other accelerators for computationally intensive code. Its compute API drivers automate the interactions between user code and the tedious and error prone APIs with the goal of enabling the rapid development of high performance D libraries and applications.<\/p>\n","protected":false},"author":22,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[26,12,9],"tags":[],"_links":{"self":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/1206"}],"collection":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/comments?post=1206"}],"version-history":[{"count":11,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/1206\/revisions"}],"predecessor-version":[{"id":1371,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/1206\/revisions\/1371"}],"wp:attachment":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/media?parent=1206"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/categories?post=1206"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/tags?post=1206"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}