{"id":2296,"date":"2020-01-27T14:04:43","date_gmt":"2020-01-27T14:04:43","guid":{"rendered":"http:\/\/dlang.org\/blog\/?p=2296"},"modified":"2021-09-30T13:37:28","modified_gmt":"2021-09-30T13:37:28","slug":"d-for-data-science-calling-r-from-d","status":"publish","type":"post","link":"https:\/\/dlang.org\/blog\/2020\/01\/27\/d-for-data-science-calling-r-from-d\/","title":{"rendered":"D For Data Science: Calling R from D"},"content":{"rendered":"<p><img loading=\"lazy\" class=\"alignleft size-full wp-image-181\" src=\"http:\/\/dlang.org\/blog\/wp-content\/uploads\/2016\/08\/d6.png\" alt=\"Digital Mars D logo\" width=\"200\" height=\"200\" \/>D is a good language for data science. The advantages include a pleasant syntax, interoperability with C (in many cases as simple as adding an <code>#include<\/code> directive to import a C header file <a href=\"https:\/\/dlang.org\/blog\/2019\/04\/08\/project-highlight-dpp\/\">via the dpp tool<\/a>), C-like speed, a large standard library, static typing, built-in unit tests and documentation generation, and a garbage collector that\u2019s there when you want it but can be avoided when you don\u2019t.<\/p>\n<p>Library selection for data science is a different story. Although there are some libraries available, such as <a href=\"http:\/\/code.dlang.org\/packages\/mir\">those provided by the mir project<\/a>, the available functionality is extremely limited compared with languages like R and Python. The good news is that it\u2019s possible to call functions in either language from D.<\/p>\n<p>This article shows how to embed an R interpreter inside a D program, pass data between the two languages, execute arbitrary R code from within a D program, and call the R interface to C, C++, and Fortran libraries from D. Although I only provide examples for Linux, the same steps apply for Windows if you\u2019re using WSL, and with minor modifications to the DUB package file, everything should work on macOS. Although it is possible to do so, I don\u2019t talk about calling D functions from R, and I don\u2019t include any discussion of interoperability with Python. (This <a href=\"http:\/\/code.dlang.org\/packages\/pyd\">is normally done using pyd<\/a>.)<\/p>\n<h2 id=\"dependencies\">Dependencies<\/h2>\n<p>The following three dependencies should be installed:<\/p>\n<ul>\n<li>R<\/li>\n<li>R package RInsideC<\/li>\n<li>R package embedr<\/li>\n<\/ul>\n<p>It\u2019s assumed that anyone reading this post already has R installed or can install it if they don\u2019t. The RInsideC package is a slightly modified version of the excellent <a href=\"http:\/\/dirk.eddelbuettel.com\/code\/rinside.html\">RInside<\/a> project of Dirk Eddelbuettel and Romain Francois. RInside provides a C++ interface to R. The modifications provide a C interface so that R can be called from any language capable of calling C functions. <a href=\"https:\/\/github.com\/r-lib\/devtools\">Install the package using devtools<\/a>:<\/p>\n<pre class=\"prettyprint lang-r\">library(devtools)\r\ninstall_bitbucket(\"bachmeil\/rinsidec\")<\/pre>\n<p><a href=\"https:\/\/embedr.netlify.com\/\">The embedr package<\/a> provides the necessary functions to work with R from within D. That package is also installed with devtools:<\/p>\n<pre class=\"prettyprint lang-r\">install_bitbucket(\"bachmeil\/embedr\")<\/pre>\n<h2 id=\"afirstprogram\">A First Program<\/h2>\n<p>The easiest way to do the compilation is to use <a href=\"https:\/\/dub.pm\/getting_started\">D\u2019s package manager, called DUB<\/a>. From within your project directory, open R and create a project skeleton:<\/p>\n<pre class=\"prettyprint lang-r\">library(embedr)\r\ndubNew()<\/pre>\n<p>This will create a <code>\/src<\/code> subdirectory to hold your project\u2019s source code if it doesn\u2019t already exist, add a file called <code>r.d<\/code> to <code>\/src<\/code>\u00a0and create a <code>dub.sdl<\/code> file in the project directory. Create a file in the <code>\/src<\/code> directory called <code>hello.d<\/code>, containing the following program:<\/p>\n<pre class=\"prettyprint lang-d\">import embedr.r;\r\n\r\nvoid main() {\r\n  evalRQ(`print(\"Hello, World!\")`);\r\n}<\/pre>\n<p>From the terminal, in the project directory (the one holding <code>dub.sdl<\/code>, not the <code>\/src<\/code> subdirectory), enter<\/p>\n<pre>dub run<\/pre>\n<p>This will print out \u201cHello, World!\u201d. The important thing to realize is that even though you just used DUB to compile and run a D program, it was R that printed \u201cHello, World!\u201d to the screen.<\/p>\n<h2 id=\"executingrcodefromd\">Executing R Code From D<\/h2>\n<p>There are two ways to execute R code from a D program. <code>evalR<\/code> executes a string in R and returns the output to D, while <code>evalRQ<\/code> does the same thing but suppresses the output. <code>evalRQ<\/code> also accepts an array of strings that are executed sequentially.<\/p>\n<p>Create a new project directory and run <code>dubNew<\/code> inside it, as you did for the first example. In the <code>src\/<\/code> subdirectory, add a file named <code>reval.d<\/code>:<\/p>\n<pre class=\"prettyprint lang-d\">import embedr.r;\r\nimport std.stdio;\r\n\r\nvoid main() {\r\n  \/\/ Example 1\r\n  evalRQ(`print(3+2)`); \/\/ evaluates to 5 in R, R prints the output [1] 5 to the screen\r\n\r\n  \/\/ Example 2\r\n  writeln(evalR(`3+2`).scalar); \/\/ evaluates to 5 in R, output is 5\r\n\r\n  \/\/ Example 3\r\n  evalRQ(`3+2`); \/\/ evaluates to 5 in R, but there is no output\r\n\r\n  \/\/ Example 4\r\n  evalRQ([`x &lt;- 3`, `y &lt;- 2`, `z &lt;- x+y`, `print(z)`]); \/\/ evaluates this code in R\r\n}<\/pre>\n<p>Example 1 tells R to print the sum of <code>3<\/code> and <code>2<\/code>. Because we use <code>evalRQ<\/code>, no output is returned to D, but R is able to print to the screen. Example 2 evaluates<code> 3+2<\/code> in R and returns the output to D in the form of an <code>Robj<\/code>. <code>evalR(``3+2``).scalar<\/code> executes <code>3+2<\/code> inside R, captures the output in an <code>Robj<\/code>, and converts the<code> Robj<\/code> into a <code>double<\/code> holding the value <code>5<\/code>. This value is passed to the <code>writeln<\/code> function and printed to the screen. Example 3 doesn\u2019t output anything, because <code>evalRQ<\/code> does not return any output, and R isn\u2019t being told to print anything to the screen. Example 4 executes the four strings in the array sequentially, returning nothing to D, but the last tells R to print the value of <code>z<\/code> to the screen.<\/p>\n<p>There\u2019s not much more to say about executing R code from D. You can execute any valid R code from D, and if there\u2019s an error, it will be caught and printed to the screen. Graphical output is automatically captured in a PDF file. To work interactively with R, or if it\u2019s sufficient to save the results to a text file and read them into D, this is all you need to know. The more interesting cases involve passing data between D and R, and for the times when there is no alternative, using the R interface to call directly into C, C++, or Fortran libraries.<\/p>\n<h2 id=\"passingdatabetweendandr\">Passing Data Between D and R<\/h2>\n<p>A little background is needed to understand how to pass data between D and R. Everything in R is represented as a C struct named <code>SEXPREC<\/code>, and a pointer to a <code>SEXPREC<\/code> struct is called a <code>SEXP<\/code> in the R source code. Those names reflect R\u2019s origin as a Scheme dialect, where code takes the form of s-expressions. In order to avoid misunderstanding, embedr uses the name <code>Robj<\/code> instead of <code>SEXP<\/code>.<\/p>\n<p>It&#8217;s necessary to let R allocate the memory for any data passed to R. For instance, you cannot tell D to allocate a <code>double[]<\/code> array and then pass a pointer to that array to R. You would instead do something like this:<\/p>\n<pre class=\"prettyprint lang-d\">auto v = RVector(100);\r\nforeach(ii; 0..100) {\r\n  v[ii] = 1.5*ii;\r\n}\r\nv.toR(\"vv\");\r\nevalRQ(`print(vv)`);<\/pre>\n<p>The first line tells R to allocate a vector with room for 100 elements. <code>v<\/code> is a D struct holding a pointer to the memory allocated by R plus additional information that allows you to read and change the elements of the vector. Behind the scenes, the <code>RVector<\/code> struct protects the vector from R\u2019s garbage collector. R is a garbage collected language, and if the only reference to the data is in your D program, there\u2019s nothing to prevent the R garbage collector from freeing that memory. The RVector struct uses the reference counting mechanism described in <a href=\"https:\/\/www.packtpub.com\/application-development\/d-cookbook\">Adam Ruppe\u2019s D Cookbook<\/a> to protect objects from R\u2019s garbage collector and unprotect them when they\u2019re no longer in use.<\/p>\n<p>After filling in all 100 elements of <code>v<\/code>, the <code>toR<\/code> function creates a new variable in R called <code>vv<\/code>, and associates it with the vector held inside <code>v<\/code>. The final line tells R to print out the variable <code>vv<\/code>.<\/p>\n<p>In practice, no data is ever passed between D and R. The only thing that\u2019s passed around is a single pointer to the memory allocated by R. That means it\u2019s practical to call R functions from D even for very large datasets.<\/p>\n<h2 id=\"callingtherapi\">Calling the R API<\/h2>\n<p><a href=\"\/\/cran.r-project.org\/doc\/manuals\/r-release\/R-exts.html#The-R-API)\">The R API<\/a> provides a convenient (by C standards) interface to some of R\u2019s functions and constants, including the numerical optimization routines underlying <code>optim<\/code>, distribution functions, and random number generators. This example shows how to solve an unconstrained nonlinear optimization problem using the Nelder-Mead algorithm, which is the default when calling <code>optim<\/code> in R.<\/p>\n<p>The objective function is<\/p>\n<pre>f = x^2 + y^2<\/pre>\n<p>We want to choose <code>x<\/code> and <code>y<\/code> to minimize <code>f<\/code>. The obvious solution is <code>x=0<\/code> and <code>y=0<\/code>.<\/p>\n<p>Create a new project directory and initialize DUB from within R, with the one additional step to add the wrapper for R\u2019s optimization libraries:<\/p>\n<pre class=\"prettyprint lang-r\">library(embedr)\r\ndubNew()\r\ndubOptim()<\/pre>\n<p><code>dubOptim()<\/code> adds the file <code>optim.d<\/code> to the <code>src\/<\/code> directory. Create a file called <code>nelder.d<\/code> inside the <code>src<\/code> directory with the following program:<\/p>\n<pre class=\"prettyprint lang-d\">import embedr.r, embedr.optim;\r\nimport std.stdio;\r\n\r\nextern(C) {\r\n  double f(int n, double * par, void * ex) {\r\n    return par[0]*par[0] + par[1]*par[1];\r\n  }\r\n}\r\n\r\nvoid main() {\r\n  auto nm = NelderMead(&amp;f);\r\n  OptimSolution sol = nm.solve([3.5, -5.5]);\r\n  sol.print;\r\n}<\/pre>\n<p>First we define the objective function, <code>f<\/code>, using the C calling convention so it can be passed to various C functions. We then create a new struct called <code>NelderMead<\/code>, passing a pointer to <code>f<\/code> to its constructor. Finally, we call the <code>solve<\/code> method, using <code>[3.5, -5.5]<\/code> as the array of starting values, and print out the solution. You\u2019ll want to confirm that the failure code in the output is false (implying the convergence criterion was met). The most common reason that Nelder-Mead will fail to converge is because it took too many iterations. To change the maximum number of iterations to 10,000, you\u2019d add <code>nm.maxit = 10_000;<\/code> to your program before the call to <code>nm.solve<\/code>.<\/p>\n<p>There\u2019s no overhead associated with calling an interpreted language in this example. We\u2019re calling a C shared library directly, and at no point does the R interpreter get involved. As in the previous example, since there\u2019s no copying of data, this approach is efficient even for large datasets. Finally, if you\u2019re not comfortable with garbage collection, the inner loops of the optimization are done entirely in C. We nonetheless do take advantage of the convenience and safety of D\u2019s garbage collector when allocating the <code>nm<\/code> and <code>sol<\/code> structs, as the performance advantages of manual memory management (to the extent that there are any) are irrelevant.<\/p>\n<h2 id=\"callingrinterfacesfromd\">Calling R Interfaces from D<\/h2>\n<p>The purpose of many R packages is to provide a convenient interface to a C, C++, or Fortran library. The term \u201cR interface\u201d normally means one of two things. For modern C or C++ code, it\u2019s a function taking <code>Robj<\/code> structs as arguments and returning one <code>Robj<\/code> struct as the output. For Fortran code and older C or C++ code, it\u2019s a void function taking pointers as arguments. In either case, you can call the R interface directly from D code, meaning any library with an R interface also has a D interface.<\/p>\n<p>An example of an R interface to Fortran code is found in <a href=\"\/\/cran.r-project.org\/web\/packages\/glmnet\/index.html)\">the popular glmnet package<\/a>.<br \/>\nLasso estimation using the <code>elnet<\/code> function is done by passing 28 pointers to the function <code>elnet<\/code> in <code>libglmnet.so<\/code> with this interface:<\/p>\n<pre class=\"prettyprint lang-r\">.Fortran(\"elnet\", ka, parm=alpha, nobs, nvars, as.double(x), y,\r\n                  weights, jd, vp, cl, ne, nx, nlam, flmin, ulam, thresh,\r\n                  isd, intr, maxit, lmu=integer(1), a0=double(nlam),\r\n                  ca=double(nx*nlam), ia=integer(nx), nin=integer(nlam),\r\n                  rsq=double(nlam), alm=double(nlam), nlp=integer(1),\r\n                  jerr=integer(1), PACKAGE=\"glmnet\")<\/pre>\n<p>You might want to work with the R interface directly if you\u2019re calling <code>elnet<\/code> inside a loop in your D program. Most of the time it\u2019s better to pass the data to R and then call the R function that calls <code>elnet<\/code>. Calling Fortran functions can be error-prone, leading to hard to debug segmentation faults.<\/p>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>D was designed from the beginning to be compatible with the C ABI. The intention was to facilitate the integration of new D code into existing C code bases. The practical result has been that, due to C\u2019s <em>lingua franca<\/em> status, D can be used in combination with myriad languages. Data scientists looking for alternatives to C and C++ when working with R may find benefit in giving D a close look.<\/p>\n<p><em>Lance Bachmeier is an associate professor of economics at Kansas State University and co-editor of the journal <\/em>Energy Economics<em>. He does research on macroeconomics and energy economics. He has been using the D programming language in his research since 2013.<\/em><\/p>\n","protected":false},"excerpt":{"rendered":"<p>D is a good language for data science. The advantages include a pleasant syntax, interoperability with C (in many cases as simple as adding an #include directive to import a C header file via the dpp tool), C-like speed, a large standard library, static typing, built-in unit tests and documentation generation, and a garbage collector [&hellip;]<\/p>\n","protected":false},"author":37,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[26,29,9],"tags":[],"_links":{"self":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/2296"}],"collection":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/users\/37"}],"replies":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/comments?post=2296"}],"version-history":[{"count":6,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/2296\/revisions"}],"predecessor-version":[{"id":2307,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/2296\/revisions\/2307"}],"wp:attachment":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/media?parent=2296"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/categories?post=2296"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/tags?post=2296"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}