{"id":972,"date":"2017-07-17T13:38:47","date_gmt":"2017-07-17T13:38:47","guid":{"rendered":"http:\/\/dlang.org\/blog\/?p=972"},"modified":"2021-10-08T11:12:24","modified_gmt":"2021-10-08T11:12:24","slug":"dcompute-gpgpu-with-native-d-for-opencl-and-cuda","status":"publish","type":"post","link":"https:\/\/dlang.org\/blog\/2017\/07\/17\/dcompute-gpgpu-with-native-d-for-opencl-and-cuda\/","title":{"rendered":"DCompute: GPGPU with Native D for OpenCL and CUDA"},"content":{"rendered":"<p><em>Nicholas Wilson is&nbsp;a student at Murdoch University, studying for his BEng (Hons)\/BSc in Industrial Computer Systems (Hons) and Instrumentation &amp; Control\/ Molecular Biology &amp; Genetics and Biomedical Science. He just finished his thesis on low-cost defect detection of solar cells by electroluminescence imaging, which gives him time to work on DCompute and write about it for the D Blog. He plays the piano, ice skates, and has spent 7 years putting D to use on number bashing, automation, and anything else that he could make a computer do for him.<\/em><\/p>\n<hr>\n<p><img loading=\"lazy\" class=\"size-full wp-image-978 alignleft\" src=\"http:\/\/dlang.org\/blog\/wp-content\/uploads\/2017\/07\/ldc.png\" alt=\"\" width=\"160\" height=\"160\">DCompute is a framework and compiler extension to support writing native kernels for OpenCL and CUDA in D to utilise GPUs and other accelerators for computationally intensive code. In development are drivers to automate the interactions between user code and the tedious and error prone compute APIs with the goal of enabling the rapid development of high performance D libraries and applications.<\/p>\n<h3>Introduction<\/h3>\n<p>After watching <a href=\"http:\/\/dconf.org\/2016\/talks\/colvin.html\">John Colvin\u2019s DConf 2016 presentation<\/a> in May of last year on using D\u2019s metaprogramming to make the OpenCL API marginally less horrible to use, I thought, \u201cThis would be so much easier to do if we were able to write kernels in D, rather than doing string manipulations in OpenCL C\u201d. At the time, I was coming up to the end of a rather busy semester and thought that would make a good winter<sup><a id=\"afnote\" href=\"#fnote\">[1]<\/a><\/sup> project. After all, <a href=\"https:\/\/github.com\/ldc-developers\/ldc\">LDC<\/a>, the LLVM D Compiler, has access to LLVM&#8217;s <a href=\"https:\/\/github.com\/thewilsonator\/llvm-target-spirv\">SPIR-V<\/a> and PTX backends, and I thought, \u201cIt can\u2019t be too hard, its only glue code\u201d. I <em>slightly<\/em> underestimated the time it would take, finishing the first stage of <a href=\"http:\/\/github.com\/libmir\/dcompute\">DCompute<\/a> (because naming things is hard), mainlining the changes I made to LDC at the end of February, eight months later &#8212; just in time for the close of submissions to DConf, where I gave a <a href=\"http:\/\/dconf.org\/2017\/talks\/wilson.html\">talk<\/a> on the progress I had made.<\/p>\n<p>Apart from familiarising myself with the LDC and DMD front-end codebases, I also had to understand the LLVM SPIR-V and PTX backends that I was trying to target, because they require the use of special metadata (for e.g. denoting a function is a kernel) and address spaces, used to represent __g<code>lobal<\/code> &amp; friends in OpenCL C and _<code>_global__<\/code> &amp; friends in CUDA, and introduce these concepts into LDC.<\/p>\n<p>But once I was familiar with the code and had sorted the above discrepancies, it was mostly smooth sailing translating the OpenCL and CUDA modifiers into compiler-recognised attributes and wrapping the intrinsics into an easy to use and consistent interface.<\/p>\n<p>When it was all working and almost ready to merge into mainline LDC, I hit a bit of a snag with regards to CI: the SPIR-V backend that was being developed by Khronos was based on the quite old LLVM 3.6.1 and, despite my suggestions, did not have any releases. So I forward ported the backend and the conversion utility to the master branch of LLVM and made a <a href=\"https:\/\/github.com\/thewilsonator\/llvm\/releases\">release<\/a> myself. Still in progress on this front are converting magic intrinsics to proper LLVM intrinsics and transitioning to a TableGen-driven approach for the backend in preparation for merging the backend into LLVM Trunk. This should hopefully be done soon\u2122.<\/p>\n<h3>Current state of DCompute<\/h3>\n<p>With the current state of DCompute we are able to write kernels natively in D and have access to most of its language-defining features like <a href=\"https:\/\/tour.dlang.org\/tour\/en\/basics\/templates\">templates &amp; static introspection<\/a>, <a href=\"https:\/\/tour.dlang.org\/tour\/en\/gems\/uniform-function-call-syntax-ufcs\">UFCS<\/a>, <a href=\"https:\/\/tour.dlang.org\/tour\/en\/gems\/scope-guards\">scope guards<\/a>, <a href=\"https:\/\/tour.dlang.org\/tour\/en\/gems\/range-algorithms\">ranges &amp; algorithms<\/a> and <a href=\"https:\/\/tour.dlang.org\/tour\/en\/gems\/compile-time-function-evaluation-ctfe\">CTFE<\/a>. Notably missing, for hardware and performance reasons, are those features commonly excluded in kernel languages, like function pointers, virtual functions, dynamic recursion, RTTI, exceptions and the use of the garbage collector. Note that unlike OpenCL C++ we allow kernel functions to be templated and have overloads and default values. Still in development is support for images and pipes.<\/p>\n<h3>Example code<\/h3>\n<p>To write kernels in D, we need to pass <code>-mdcompute-targets=&lt;targets&gt;<\/code> to LDC, where <code>&lt;targets&gt;<\/code> is a comma-separated list of the desired targets to build for, e.g. <code>ocl-120,cuda-350<\/code> for OpenCL 1.2 and CUDA compute capability 3.5, respectively (yes, we can do them all at once!). We get one file for each target, e.g. <code>kernels_ocl120_64.spv<\/code>, when built in 64-bit mode, which contains all of the code for that device.<\/p>\n<p>The <code>vector add<\/code> kernel in D is:<\/p>\n<pre class=\"prettyprint lang-d\">@compute(CompileFor.deviceOnly) module example;\nimport ldc.dcompute;\nimport dcompute.std.index;\n\nalias gf = GlobalPointer!float;\n\n@kernel void vadd(gf a, gf b, gf c) \n{\n\tauto x = GlobalIndex.x;\n\ta[x] = b[x]+c[x];\n}<\/pre>\n<p>Modules marked with the <code>@compute<\/code> attribute are compiled for each of the command line targets, <code>@kernel<\/code> makes a function a kernel, and <code>GlobalPointer<\/code> is the equivalent of the <code>__global<\/code> qualifier in OpenCL.<\/p>\n<p>Kernels are not restricted to just functions &#8212; lambdas &amp; tamplates also work:<\/p>\n<pre class=\"prettyprint lang-d\">@kernel void map(alias F)(KernelArgs!F args)\n{\n    F(args);\n}\n\/\/In host code\nAutoBuffer!float x,y,z; \/\/ y &amp; z initialised with data\nq.enqueue!(map!((a,b,c) =&gt; a=b+c))(x.length)(x, y, z);<\/pre>\n<p>Where <code>KernelArgs<\/code> translates host types to device types (e.g. buffers to pointers or, as in this example, AutoBuffers to <a href=\"https:\/\/github.com\/libmir\/dcompute\/blob\/master\/source\/dcompute\/std\/index.d#L298\">AutoIndexed Pointers<\/a>) so that we encapsulate the differences in the host and device types.<\/p>\n<p>The last line is the expected syntax for launching kernels, <code>q.enqueue!kernel(dimensions)(args)<\/code>, akin to CUDA\u2019s <code>kernel&lt;&lt;&lt;dimensions,queue&gt;&gt;&gt;(args)<\/code>. The libraries for launching kernels are in development.<\/p>\n<p>Unlike CUDA, where all the magic for transforming the above expression into code on the host lies in the compiler, <code>q.enqueue!func(sizes)(args)<\/code> will be processed by static introspection of the driver library of DCompute.<br \/>\nThe sole reason we can do this in D is that we are able to query the mangled name the compiler will give to a symbol via the symbol\u2019s <code>.mangleof<\/code> property. This, in combination with D\u2019s easy to use and powerful templates, means we can significantly reduce the mental overhead associated with using the compute APIs. Also, implementing this in the library will be much simpler, and therefore faster to implement, than putting the same behaviour in the compiler. While this may not seem much for CUDA users, this will be a breath of fresh air to OpenCL users (just look at the <a href=\"http:\/\/www.heterogeneouscompute.org\/wordpress\/wp-content\/uploads\/2011\/06\/Chapter2.txt\">OpenCL vector add host code example<\/a> steps 7-11).<\/p>\n<p>While you cant do that just yet in DCompute, development should start to progress quickly and hopefully become a reality soon.<\/p>\n<p>I would like to thank John Colvin for the initial inspiration, Mike Parker for editing, and the LDC folks, David Nadlinger, Kai Nacke, Martin Kinke, with a special thanks to Johan Engelen, for their help with understanding the LDC codebase and reviewing my work.<\/p>\n<p>If you would like to help develop DCompute (or be kept in the loop), feel free to drop a line at the <a href=\"https:\/\/gitter.im\/libmir\/public\">libmir Gitter<\/a>. Similarly, any efforts preparing the <a href=\"https:\/\/github.com\/thewilsonator\/llvm\">SPIR-V<\/a> <a href=\"https:\/\/github.com\/thewilsonator\/llvm-target-spirv\">backend<\/a> for inclusion into LLVM are also greatly appreciated.<\/p>\n<p id=\"fnote\"><a href=\"#afnote\">[1]<\/a> Southern hemisphere.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Nicholas Wilson is&nbsp;a student at Murdoch University, studying for his BEng (Hons)\/BSc in Industrial Computer Systems (Hons) and Instrumentation &amp; Control\/ Molecular Biology &amp; Genetics and Biomedical Science. He just finished his thesis on low-cost defect detection of solar cells by electroluminescence imaging, which gives him time to work on DCompute and write about it [&hellip;]<\/p>\n","protected":false},"author":22,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[26,12,9],"tags":[],"_links":{"self":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/972"}],"collection":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/users\/22"}],"replies":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/comments?post=972"}],"version-history":[{"count":9,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/972\/revisions"}],"predecessor-version":[{"id":982,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/972\/revisions\/982"}],"wp:attachment":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/media?parent=972"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/categories?post=972"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/tags?post=972"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}