{"id":3033,"date":"2022-01-14T13:32:07","date_gmt":"2022-01-14T13:32:07","guid":{"rendered":"https:\/\/dlang.org\/blog\/?p=3033"},"modified":"2023-07-11T16:47:56","modified_gmt":"2023-07-11T16:47:56","slug":"using-the-gcc-static-analyzer-on-the-d-programming-language","status":"publish","type":"post","link":"https:\/\/dlang.org\/blog\/2022\/01\/14\/using-the-gcc-static-analyzer-on-the-d-programming-language\/","title":{"rendered":"Using the GCC Static Analyzer on the D Programming Language"},"content":{"rendered":"<p><img loading=\"lazy\" src=\"https:\/\/dlang.org\/blog\/wp-content\/uploads\/2018\/02\/bug-200x156.jpg\" alt=\"\" width=\"200\" height=\"156\" class=\"alignleft size-thumbnail wp-image-1345\" \/><\/p>\n<p>Largely thanks to <a href=\"https:\/\/github.com\/ibuclaw\">the tireless work of Iain Buclaw<\/a>, the D programming language is part of GCC. As well as having access to an extremely potent set of compiler optimizations and a large group of target platforms, D also benefits from upstream features added to GCC as a whole or even for specific languages. For some projects, this can be very important, as some of these features require large quantities of careful work, for example, mitigations for transient execution vulnerabilities.<\/p>\n<p>A few years ago, <a href=\"https:\/\/developers.redhat.com\/author\/david-malcolm\">thanks to David Malcolm<\/a> at Red Hat, GCC gained a static analyzer. This uses a set of algorithms at compile time to find patterns in a program that would lead to memory safety bugs when the program is executed.<\/p>\n<h2 id=\"howdoiturniton\">How do I turn it on?<\/h2>\n<p>Run GDC like you normally would and add the <code>-fanalyzer<\/code> flag. If you&#8217;re already bored of reading and want to have a go, please use Matt Godbolt&#8217;s excellent compiler explorer. <a href=\"https:\/\/d.godbolt.org\/z\/Yz4n6c9nj\">Start with this simple example<\/a>.<\/p>\n<h2 id=\"whichpatternsdoesitlookfor\">Which patterns does it look for?<\/h2>\n<h3 id=\"somememorybugs\">Some memory bugs<\/h3>\n<p>From the GCC documentation, we can get a list of every warning the analyzer can emit:<\/p>\n<pre>-Wanalyzer-double-fclose \n-Wanalyzer-double-free \n-Wanalyzer-exposure-through-output-file \n-Wanalyzer-file-leak \n-Wanalyzer-free-of-non-heap \n-Wanalyzer-malloc-leak \n-Wanalyzer-mismatching-deallocation \n-Wanalyzer-possible-null-argument \n-Wanalyzer-possible-null-dereference \n-Wanalyzer-null-argument \n-Wanalyzer-null-dereference \n-Wanalyzer-shift-count-negative \n-Wanalyzer-shift-count-overflow \n-Wanalyzer-stale-setjmp-buffer \n-Wanalyzer-tainted-array-index \n-Wanalyzer-unsafe-call-within-signal-handler \n-Wanalyzer-use-after-free \n-Wanalyzer-use-of-pointer-in-stale-stack-frame \n-Wanalyzer-write-to-const \n-Wanalyzer-write-to-string-literal <\/pre>\n<p>These names are fairly descriptive. However, let&#8217;s take a look at some examples before going into detail.<\/p>\n<p>Let&#8217;s say we have some code that allocates a buffer for itself via <code>malloc<\/code>, like the following.<\/p>\n<pre class=\"prettyprint lang-d\">int usesTheHeap(size_t x)\n{\n    import core.stdc.stdlib : malloc, free;\n    int[] slice = (cast(int*) malloc(int.sizeof * x))[0..x];\n    slice[] = 0;\n    \/\/ Algorithm goes here\n    return 0;\n}<\/pre>\n<p>For this code, the static analyzer gives us two warnings, the first of which is the following:<\/p>\n<pre>warning: leak of 'slice.ptr' [CWE-401]\n   11 | }\n      | ^\n  'usesTheHeap': events 1-3\n    |\n    |    8 |     int[] slice = (cast(int*) malloc(int.sizeof * x))[0..x];\n    |      |                                     ^\n    |      |                                     |\n    |      |                                     (1) allocated here\n    |    9 |     slice[] = 0;\n    |      |     ~                                \n    |      |     |\n    |      |     (2) assuming 'slice.ptr' is non-NULL\n    |   10 |     \/\/ Algorithm goes here\n    |   11 | }\n    |      | ~                                    \n    |      | |\n    |      | (3) 'slice.ptr' leaks here; was allocated at (1)<\/pre>\n<p>As you might expect, since we didn&#8217;t free the memory we allocated, the analyzer warns us that the memory leaks at the end of the scope.<\/p>\n<p>The second warning complains that we used the memory from <code>malloc<\/code> without checking if it was <code>null<\/code>. Program failure due to dereferencing a null-pointer is sometimes desirable in D, so you can turn this off with <code>-Wno-analyzer-possible-null-dereference<\/code> if you need to.<\/p>\n<p>Thanks to <code>assert<\/code> being built into the core language and being lowered to a construct that GCC understands, we can use it to make the analyzer assume a pointer is non-null:<\/p>\n<pre class=\"prettyprint lang-d\">int usesTheHeap(size_t x)\n{\n    import core.stdc.stdlib : malloc, free;\n    void* allocatedBuffer = malloc(int.sizeof * x);\n    assert(allocatedBuffer != null);\n    \/\/ The program may not proceed if the pointer is null\n    int[] slice = (cast(int*) allocatedBuffer)[0..x];\n    slice[] = 0; \/\/So the analyzer knows this is safe.\n    \/\/ Algorithm goes here\n    return 0;\n}<\/pre>\n<h3 id=\"morethanmallocandfree\">More than <code>malloc<\/code> and <code>free<\/code><\/h3>\n<p>Let&#8217;s think about something that (obviously) uses memory, but isn&#8217;t always considered part of memory safety: although it&#8217;s not encouraged, you can use <code>setjmp<\/code> and <code>longjmp<\/code> from C in D code. As with many C features, these really can blow up in your face.<\/p>\n<p>Look at the following:<\/p>\n<pre class=\"prettyprint lang-d\">import core.sys.posix.setjmp;\n\nvoid main()\n{\n    jmp_buf local;\n    void set()\n    {\n        setjmp(local);\n    }\n    set();\n    longjmp(local, 0);\n} <\/pre>\n<p>We set the buffer inside <code>set<\/code>, but the buffer is now primed, ready, and pointing to nothing (technically it is something but that something is chaotic). Thankfully, the analyzer can warn us about this as in the following:<\/p>\n<pre>&lt;source&gt;: In function 'D main':\n&lt;source&gt;:11:12: warning: 'longjmp' called after enclosing function of 'setjmp' has returned [-Wanalyzer-stale-setjmp-buffer]\n   11 |     longjmp(local, 0);\n      |            ^\n  'D main': events 1-2\n    |\n    |    3 | void main()\n    |      |      ^\n    |      |      |\n    |      |      (1) entry to 'D main'\n    |......\n    |   10 |     set();\n    |      |        ~\n    |      |        |\n    |      |        (2) calling 'set' from 'D main'\n    |\n    +--&gt; 'set': events 3-5\n           |\n           |    6 |     void set()\n           |      |          ^\n           |      |          |\n           |      |          (3) entry to 'set'\n           |    7 |     {\n           |    8 |         setjmp(local);\n           |      |               ~\n           |      |               |\n           |      |               (4) 'setjmp' called here\n           |    9 |     }\n           |      |     ~     \n           |      |     |\n           |      |     (5) stack frame is popped here, invalidating saved environment\n           |\n    &lt;------+\n    |\n  'D main': events 6-7\n    |\n    |   10 |     set();\n    |      |        ^\n    |      |        |\n    |      |        (6) returning to 'D main' from 'set'\n    |   11 |     longjmp(local, 0);\n    |      |            ~\n    |      |            |\n    |      |            (7) 'longjmp' called after enclosing function of 'setjmp' returned at (5)\n    |<\/pre>\n<h3 id=\"beyondskin-deep\">Beyond skin-deep<\/h3>\n<p>While important, stack corruption and (simple) memory leaks are old hat; catching them is usually relatively (touch wood) easy with modern programming practices, programming language design (i.e., sound memory safety analysis), sanitizers, and toolings like Valgrind or your favorite debugger. For less trivial issues, <em>finding<\/em> the issues when they happen in a controlled environment is still relatively easy with the above tools if the program fails, but finding <em>why<\/em> they happened could require manually instrumenting the program. Finding issues early is important and appreciated.<\/p>\n<p>The analyzer is interprocedural, i.e., it can see across function boundaries (when the information is available). In some older codebases you can sometimes see code like this:<\/p>\n<pre class=\"prettyprint lang-d\">struct Handle\n{\n    void* x;\n    void reset()\n    {\n        free(x);\n    }\n    ~this()\n    {\n        free(x);\n    }\n}\nvoid accept(Handle x)\n{\n    x.reset();\n    \/\/ Destructor called \n}<\/pre>\n<p>This yields a double-free. The analyzer is able to see &#8220;inside&#8221; the destructor and thus correctly warns about the double-free and what causes it.<\/p>\n<p>The following seems to be sensitive to the optimization settings used but is very important when it works: iterator invalidation. That is to say, we hand out a pointer to somewhere, end up (say) <code>realloc<\/code>-ing, and suddenly that pristine pointer is now a pointer to absolutely nowhere.<\/p>\n<pre class=\"prettyprint lang-d\">struct Vector\n{\n    int* handle;\n    void expand(size_t sz)\n    {\n        int* newPtr = cast(int*) realloc(handle, sz);\n        assert(newPtr);\n        handle = newPtr;\n    }\n    ~this()\n    {\n        free(handle);\n    }\n}\nvoid iter(Vector x)\n{\n    int* copy = x.handle;\n    x.expand(1000);\n    *copy = 3;\n}<\/pre>\n<p>The analyzer sees this and spits out the following:<\/p>\n<pre>&lt;source&gt;: In function 'iter':\n&lt;source&gt;:23:11: warning: use after 'free' of 'copy_5' [CWE-416] [-Wanalyzer-use-after-free]\n   23 |     *copy = 3;\n      |           ^\n  'iter': events 1-2\n    |\n    |   19 | void iter(Vector x)\n    |      |      ^\n    |      |      |\n    |      |      (1) entry to 'iter'\n    |......\n    |   22 |     x.expand(1000);\n    |      |             ~\n    |      |             |\n    |      |             (2) calling 'expand' from 'iter'\n    |\n    +--&gt; 'expand': events 3-7\n           |\n           |    8 |     void expand(size_t sz)\n           |      |          ^\n           |      |          |\n           |      |          (3) entry to 'expand'\n           |    9 |     {\n           |   10 |         int* newPtr = cast(int*) realloc(handle, sz);\n           |      |                                         ~\n           |      |                                         |\n           |      |                                         (4) freed here\n           |      |                                         (5) when '__builtin_realloc' succeeds, moving buffer\n           |   11 |         assert(newPtr);\n           |      |         ~ \n           |      |         |\n           |      |         (6) following 'false' branch...\n           |   12 |         handle = newPtr;\n           |      |                ~\n           |      |                |\n           |      |                (7) ...to here\n           |\n    &lt;------+\n    |\n  'iter': events 8-9\n    |\n    |   22 |     x.expand(1000);\n    |      |             ^\n    |      |             |\n    |      |             (8) returning to 'iter' from 'expand'\n    |   23 |     *copy = 3;\n    |      |           ~  \n    |      |           |\n    |      |           (9) use after 'free' of 'copy_5'; freed at (4)\n    |<\/pre>\n<h3 id=\"inlineassembly\">Inline assembly<\/h3>\n<p>The analyzer was partly intended to help eliminate bugs in the Linux kernel. As such, it is useful to be able to analyze inline assembly (which is commonplace in the kernel). An example will not be given here, but GCC has gained the ability to analyze basic X86 inline assembly.<\/p>\n<h2 id=\"someidiosyncrasies\">Some idiosyncrasies<\/h2>\n<p>The static analyzer is implemented as just another pass inside GCC (there are hundreds). This means that some warnings may magically disappear under certain optimization settings as the compiler eliminates dead code and propagates information.<\/p>\n<p>Similarly, the quality of output does vary with the flags used. We won&#8217;t discuss it here, but options exist to increase the usefulness of diagnostics by performing more sophisticated analysis, for example, by propagating constraints through analyzed branches and thus eliminating some paths which are superficially &#8220;possible&#8221; but can, in fact, be eliminated by considering the semantics of the code.<\/p>\n<h2 id=\"findingbugswhencombiningcandd\">Finding bugs when combining C and D<\/h2>\n<p>The static analyzer was designed for use with C (and C++, but mostly the former) and <a href=\"https:\/\/en.wikipedia.org\/wiki\/Intermediate_representation\">operates on GCC&#8217;s IR<\/a>. If we use link-time optimization, we can combine the IR from compilation units in different languages (D and C), then use the analyzer to look for bugs across language boundaries.<\/p>\n<p>Let&#8217;s say we have an unfortunate C library with two functions, <code>doWork<\/code> and <code>terminate<\/code>. They both accept <code>void*<\/code>, but they expect the memory to be allocated by the user of the library rather than by a matching <code>init<\/code> function.<\/p>\n<pre class=\"prettyprint lang-c_cpp\">#include &lt;stdlib.h&gt;\nvoid doWork(void* ptr)\n{\n    \/\/ Do something, doesn't matter what here\n}\nvoid terminate(void* ptr)\n{\n    \/\/ Clean up things attached to ptr\n    free(ptr);\n}<\/pre>\n<p>Assuming we have no access to the C source and assuming the library documentation fails to mention that <code>terminate<\/code> calls <code>free<\/code>, we would likely write the following code:<\/p>\n<pre class=\"prettyprint lang-d\">extern(C) void doWork(void*);\nextern(C) void terminate(void*);\n\nvoid main()\n{\n    import core.stdc.stdlib : malloc, free;\n    void* buf = malloc(100);\n    scope(exit) free(buf);\n    buf.doWork();\n    buf.terminate();\n}<\/pre>\n<p>If we&#8217;re lucky, we&#8217;ll see an error message like<\/p>\n<pre>free(): double free detected in tcache 2\nAborted (core dumped)<\/pre>\n<p>which is better than nothing but nonetheless not ideal if we were unfamiliar with the code.<\/p>\n<p>If instead, we compile with <code>gdc d.d c.c -fanalyzer -flto<\/code> (the last flag is essential), we get this warning:<\/p>\n<pre>In function \u2018D main\u2019:\nd.d:11:14: warning: double-\u2018free\u2019 of \u2018buf_6\u2019 [CWE-415] [-Wanalyzer-double-free]\n   11 |  scope(exit) free(buf);\n      |              ^\n  \u2018D main\u2019: event 1\n    |\n    |\/usr\/lib\/gcc\/x86_64-linux-gnu\/10\/include\/d\/__entrypoint.di:33:5:\n    |   33 | int _Dmain(char[][] args);\n    |      |     ^\n    |      |     |\n    |      |     (1) entry to \u2018D main\u2019\n    |\n  \u2018D main\u2019: events 2-3\n    |\n    |d.d:10:8:\n    |   10 |  void* buf = malloc(100);\n    |      |        ^\n    |      |        |\n    |      |        (2) allocated here\n    |......\n    |   13 |  buf.terminate();\n    |      |  ~\n    |      |  |\n    |      |  (3) calling \u2018terminate\u2019 from \u2018D main\u2019\n    |\n    +--&gt; \u2018terminate\u2019: events 4-5\n           |\n           |c.c:6:6:\n           |    6 | void terminate(void* ptr)\n           |      |      ^\n           |      |      |\n           |      |      (4) entry to \u2018terminate\u2019\n           |    7 | {\n           |    8 |     free(ptr);\n           |      |     ~\n           |      |     |\n           |      |     (5) first \u2018free\u2019 here\n           |\n    &lt;------+\n    |\n  \u2018D main\u2019: events 6-7\n    |\n    |d.d:13:2:\n    |   11 |  scope(exit) free(buf);\n    |      |              ~\n    |      |              |\n    |      |              (7) second \u2018free\u2019 here; first \u2018free\u2019 was at (5)\n    |   12 |  buf.doWork();\n    |   13 |  buf.terminate();\n    |      |  ^\n    |      |  |\n    |      |  (6) returning to \u2018D main\u2019 from \u2018terminate\u2019\n    |<\/pre>\n<p>This found our bug straight away. Thank you very much, static analysis.<\/p>\n<h2 id=\"conclusion\">Conclusion<\/h2>\n<p>The way this analyzer is implemented can serve as a lesson on the usefulness of IRs as a tool for analysis rather than merely optimization. A similar analysis is currently performed on the AST in the D frontend, but that&#8217;s slow and fairly ugly to write (let alone read).<\/p>\n<p>I don&#8217;t think using a static analyzer is a replacement for a carefully designed language-level memory safety story, but I am very glad it exists. The fact that it is usable and useful from D is a testament to the benefits of D&#8217;s presence in GCC and diversity of implementation.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Largely thanks to the tireless work of Iain Buclaw, the D programming language is part of GCC. As well as having access to an extremely potent set of compiler optimizations and a large group of target platforms, D also benefits from upstream features added to GCC as a whole or even for specific languages. For [&hellip;]<\/p>\n","protected":false},"author":45,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":[],"categories":[12],"tags":[],"_links":{"self":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/3033"}],"collection":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/users\/45"}],"replies":[{"embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/comments?post=3033"}],"version-history":[{"count":3,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/3033\/revisions"}],"predecessor-version":[{"id":3037,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/posts\/3033\/revisions\/3037"}],"wp:attachment":[{"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/media?parent=3033"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/categories?post=3033"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dlang.org\/blog\/wp-json\/wp\/v2\/tags?post=3033"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}