Category Archives: BetterC

DasBetterC: Converting make.c to D

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century. This post is the third in a series about D’s BetterC mode


D as BetterC (a.k.a. DasBetterC) is a way to upgrade existing C projects to D in an incremental manner. This article shows a step-by-step process of converting a non-trivial C project to D and deals with common issues that crop up.

While the dmd D compiler front end has already been converted to D, it’s such a large project that it can be hard to see just what was involved. I needed to find a smaller, more modest project that can be easily understood in its entirety, yet is not a contrived example.

The old make program I wrote for the Datalight C compiler in the early 1980’s came to mind. It’s a real implementation of the classic make program that’s been in constant use since the early 80’s. It’s written in pre-Standard C, has been ported from system to system, and is a remarkably compact 1961 lines of code, including comments. It is still in regular use today.

Here’s the make manual, and the source code. The executable size for make.exe is 49,692 bytes and the last modification date was Aug 19, 2012.

The Evil Plan is:

  1. Minimize diffs between the C and D versions. This is so that if the programs behave differently, it is far easier to figure out the source of the difference.
  2. No attempt will be made to fix or improve the C code during translation. This is also in the service of (1).
  3. No attempt will be made to refactor the code. Again, see (1).
  4. Duplicate the behavior of the C program as exactly and as much as possible,
    bugs and all.
  5. Do whatever is necessary as needed in the service of (4).

Once that is completed, only then is it time to fix, refactor, clean up, etc.

Spoiler Alert!

The completed conversion. The resulting executable is 52,252 bytes (quite comparable to the original 49,692). I haven’t analyzed the increment in size, but it is likely due to instantiations of the NEWOBJ template (a macro in the C version), and changes in the DMC runtime library since 2012.

Step By Step

Here are the differences between the C and D versions. It’s 664 out of 1961 lines, about a third, which looks like a lot, but I hope to convince you that nearly all of it is trivial.

The #include files are replaced by corresponding D imports, such as replacing #include <stdio.h> with import core.stdc.stdio;. Unfortunately, some of the #include files are specific to Digital Mars C, and D versions do not exist (I need to fix that). To not let that stop the project, I simply included the relevant declarations in lines 29 to 64. (See the documentation for the import declaration.)

#if _WIN32 is replaced with version (Windows). (See the documentation for the version condition and predefined versions.)

extern (C): marks the remainder of the declarations in the file as compatible with C. (See the documentation for the linkage attribute.)

A global search/replace changes uses of the debug1, debug2 and debug3 macros to debug printf. In general, #ifdef DEBUG preprocessor directives are replaced with debug conditional compilation. (See the documentation for the debug statement.)

/* Delete these old C macro definitions...
#ifdef DEBUG
-#define debug1(a)       printf(a)
-#define debug2(a,b)     printf(a,b)
-#define debug3(a,b,c)   printf(a,b,c)
-#else
-#define debug1(a)
-#define debug2(a,b)
-#define debug3(a,b,c)
-#endif
*/

// And replace their usage with the debug statement
// debug2("Returning x%lx\n",datetime);
debug printf("Returning x%lx\n",datetime);

The TRUE, FALSE and NULL macros are search/replaced with true, false, and null.

The ESC macro is replaced by a manifest constant. (See the documentation for manifest constants.)

// #define ESC     '!'
enum ESC =      '!';

The NEWOBJ macro is replaced with a template function.

// #define NEWOBJ(type)    ((type *) mem_calloc(sizeof(type)))
type* NEWOBJ(type)() { return cast(type*) mem_calloc(type.sizeof); }

The filenamecmp macro is replaced with a function.

Support for obsolete platforms is removed.

Global variables in D are placed by default into thread-local storage (TLS). But since make is a single-threaded program, they can be inserted into global storage with the __gshared storage class. (See the documentation for the __gshared attribute.)

// int CMDLINELEN;
__gshared int CMDLINELEN

D doesn’t have a separate struct tag name space, so the typedefs are not necessary. An
alias can be used instead. (See the documentation for alias declarations.) Also, struct is omitted from variable declarations.

/*
typedef struct FILENODE
        {       char            *name,genext[EXTMAX+1];
                char            dblcln;
                char            expanding;
                time_t          time;
                filelist        *dep;
                struct RULE     *frule;
                struct FILENODE *next;
        } filenode;
*/
struct FILENODE
{
        char            *name;
        char[EXTMAX1]  genext;
        char            dblcln;
        char            expanding;
        time_t          time;
        filelist        *dep;
        RULE            *frule;
        FILENODE        *next;
}

alias filenode = FILENODE;

macro is a keyword in D, so we’ll just use MACRO instead.

Grouping together multiple pointer declarations is not allowed in D, use this instead:

// char *name,*text;
// In D, the * is part of the type and 
// applies to each symbol in the declaration.
char* name, text;

C array declarations are transformed to D array declarations. (See the documentation for D’s declaration syntax.)

// char            *name,genext[EXTMAX+1];
char            *name;
char[EXTMAX+1]  genext;

static has no meaning at module scope in D. static globals in C are equivalent to private module-scope variables in D, but that doesn’t really matter when the module is never imported anywhere. They still need to be __gshared and that can be applied to an entire block of declarations. (See the documentation for the static attribute)

/*
static ignore_errors = FALSE;
static execute = TRUE;
static gag = FALSE;
static touchem = FALSE;
static debug = FALSE;
static list_lines = FALSE;
static usebuiltin = TRUE;
static print = FALSE;
...
*/

__gshared
{
    bool ignore_errors = false;
    bool execute = true;
    bool gag = false;
    bool touchem = false;
    bool xdebug = false;
    bool list_lines = false;
    bool usebuiltin = true;
    bool print = false;
    ...
}

Forward reference declarations for functions are not necessary in D. Functions defined in a module can be called at any point in the same module, before or after their definition.

Wildcard expansion doesn’t have much meaning to a make program.

Function parameters declared with array syntax are pointers in reality, and are declared as pointers in D.

// int cdecl main(int argc,char *argv[])
int main(int argc,char** argv)

mem_init() expands to nothing and we previously removed the macro.

C code can play fast and loose with arguments to functions, D demands that function prototypes be respected.

void cmderr(const char* format, const char* arg) {...}

// cmderr("can't expand response file\n");
cmderr("can't expand response file\n", null);

Global search/replace C’s arrow operator (->) with the dot operator (.), as member access in D is uniform.

Replace conditional compilation directives with D’s version.

/*
 #if TERMCODE
    ...
 #endif
*/
    version (TERMCODE)
    {
        ...
    }

The lack of function prototypes shows the age of this code. D requires proper prototypes.

// doswitch(p)
// char *p;
void doswitch(char* p)

debug is a D keyword. Rename it to xdebug.

The \n\ line endings for C multiline string literals are not necessary in D.

Comment out unused code using D’s /+ +/ nesting block comments. (See the documentation for line, block and nesting block comments.)

static if can replace many uses of #if. (See the documentation for the static if condition.)

Decay of arrays to pointers is not automatic in D, use .ptr.

// utime(name,timep);
utime(name,timep.ptr);

Use const for C-style strings derived from string literals in D, because D won’t allow taking mutable pointers to string literals. (See the documentation for const and immutable.)

// linelist **readmakefile(char *makefile,linelist **rl)
linelist **readmakefile(const char *makefile,linelist **rl)

void* cannot be implicitly cast to char*. Make it explicit.

// buf = mem_realloc(buf,bufmax);
buf = cast(char*)mem_realloc(buf,bufmax);

Replace unsigned with uint.

inout can be used to transfer the “const-ness” of a function from its argument to its return value. If the parameter is const, so will be the return value. If the parameter is not const, neither will be the return value. (See the documentation for inout functions.)

// char *skipspace(p) {...}
inout(char) *skipspace(inout(char)* p) {...}

arraysize can be replaced with the .length property of arrays. (See the documentation for array properties.)

// useCOMMAND  |= inarray(p,builtin,arraysize(builtin));
useCOMMAND  |= inarray(p,builtin.ptr,builtin.length)

String literals are immutable, so it is necessary to replace mutable ones with a stack allocated array. (See the documentation for string literals.)

// static char envname[] = "@_CMDLINE";
char[10] envname = "@_CMDLINE";

.sizeof replaces C’s sizeof(). (See the documentation for the .sizeof property).

// q = (char *) mem_calloc(sizeof(envname) + len);
q = cast(char *) mem_calloc(envname.sizeof + len);

Don’t care about old versions of Windows.

Replace ancient C usage of char * with void*.

And that wraps up the changes! See, not so bad. I didn’t set a timer, but I doubt this took more than an hour, including debugging a couple errors I made in the process.

This leaves the file man.c, which is used to open the browser on the make manual page when the -man switch is given. Fortunately, this was already ported to D, so we can just copy that code.

Building make is so easy it doesn’t even need a makefile:

\dmd2.079\windows\bin\dmd make.d dman.d -O -release -betterC -I. -I\dmd2.079\src\druntime\import\ shell32.lib

Summary

We’ve stuck to the Evil Plan of translating a non-trivial old school C program to D, and thereby were able to do it quickly and get it working correctly. An equivalent executable was generated.

The issues encountered are typical and easily dealt with:

  • Replacement of #include with import
  • Lack of D versions of #include files
  • Global search/replace of things like ->
  • Replacement of preprocessor macros with:
    • manifest constants
    • simple templates
    • functions
    • version declarations
    • debug declarations
  • Handling identifiers that are D keywords
  • Replacement of C style declarations of pointers and arrays
  • Unnecessary forward references
  • More stringent typing enforcement
  • Array handling
  • Replacing C basic types with D types

None of the following was necessary:

  • Reorganizing the code
  • Changing data or control structures
  • Changing the flow of the program
  • Changing how the program works
  • Changing memory management

Future

Now that it is in DasBetterC, there are lots of modern programming features available to improve the code:

Action

Let us know over at the D Forum how your DasBetterC project is coming along!

Vanquish Forever These Bugs That Blasted Your Kingdom

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century. This post is the second in a series about D’s BetterC mode.


Do you ever get tired of bugs that are easy to make, hard to check for, often don’t show up in testing, and blast your kingdom once they are widely deployed? They cost you time and money again and again. If you were only a better programmer, these things wouldn’t happen, right?

Maybe it’s not you at all. I’ll show how these bugs are not your fault – they’re the tools’ fault, and by improving the tools you’ll never have your kingdom blasted by them again.

And you won’t have to compromise, either.

Array Overflow

Consider this conventional program to calculate the sum of an array:

#include <stdio.h>

#define MAX 10

int sumArray(int* p) {
    int sum = 0;
    int i;
    for (i = 0; i <= MAX; ++i)
        sum += p[i];
    return sum;
}

int main() {
    static int values[MAX] = { 7,10,58,62,93,100,8,17,77,17 };
    printf("sum = %d\n", sumArray(values));
    return 0;
}

The program should print:

sum = 449

And indeed it does, on my Ubuntu Linux system, with both gcc and clang and -Wall. I’m sure you already know what the bug is:

for (i = 0; i <= MAX; ++i)
              ^^

This is the classic “fencepost problem”. It goes through the loop 11 times instead of 10. It should properly be:

for (i = 0; i < MAX; ++i)

Note that even with the bug, the program still produced the correct result! On my system, anyway. So I wouldn’t have detected it. On the customer’s system, well, then it mysteriously fails, and I have a remote heisenbug. I’m already tensing up anticipating the time and money this is going to cost me.

It’s such a rotten bug that over the years I have reprogrammed my brain to:

  1. Never, ever use “inclusive” upper bounds.
  2. Never, ever use <= in a for loop condition.

By making myself a better programmer, I have solved the problem! Or have I? Not really. Let’s look again at the code from the perspective of the poor schlub who has to review it. He wants to ensure that sumArray() is correct. He must:

  1. Look at all callers of sumArray() to see what kind of pointer is being passed.
  2. Verify that the pointer actually is pointing to an array.
  3. Verify that the size of the array is indeed MAX.

While this is trivial for the trivial program as presented here, it doesn’t really scale as the program complexity goes up. The more callers there are of sumArray, and the more indirect the data structures being passed to sumArray, the harder it is to do what amounts to data flow analysis in your head to ensure it is correct.

Even if you get it right, are you sure? What about when someone else checks in a change, is it still right? Do you want to do that analysis again? I’m sure you have better things to do. This is a tooling problem.

The fundamental issue with this particular problem is that a C array decays to a pointer when it’s an argument to a function, even if the function parameter is declared to be an array. There’s just no escaping it. There’s no detecting it, either. (At least gcc and clang don’t detect it, maybe someone has developed an analyzer that does).

And so the tool to fix it is D as a BetterC compiler. D has the notion of a dynamic array, which is simply a fat pointer, that is laid out like:

struct DynamicArray {
    T* ptr;
    size_t length;
}

It’s declared like:

int[] a;

and with that the example becomes:

import core.stdc.stdio;

extern (C):   // use C ABI for declarations

enum MAX = 10;

int sumArray(int[] a) {
    int sum = 0;
    for (int i = 0; i <= MAX; ++i)
        sum += a[i];
    return sum;
}

int main() {
    __gshared int[MAX] values = [ 7,10,58,62,93,100,8,17,77,17 ];
    printf("sum = %d\n", sumArray(values));
    return 0;
}

Compiling:

dmd -betterC sum.d

Running:

./sum
Assertion failure: 'array overflow' on line 11 in file 'sum.d'

That’s more like it. Replacing the <= with < we get:

./sum
sum = 449

What’s happening is the dynamic array a is carrying its dimension along with it and the compiler inserts an array bounds overflow check.

But wait, there’s more.

There’s that pesky MAX thing. Since the a is carrying its dimension, that can be used instead:

for (int i = 0; i < a.length; ++i)

This is such a common idiom, D has special syntax for it:

foreach (value; a)
    sum += value;

The whole function sumArray() now looks like:

int sumArray(int[] a) {
    int sum = 0;
    foreach (value; a)
        sum += value;
    return sum;
}

and now sumArray() can be reviewed in isolation from the rest of the program. You can get more done in less time with more reliability, and so can justify getting a pay raise. Or at least you won’t have to come in on weekends on an emergency call to fix the bug.

“Objection!” you say. “Passing a to sumArray() requires two pushes to the stack, and passing p is only one. You said no compromise, but I’m losing speed here.”

Indeed you are, in cases where MAX is a manifest constant, and not itself passed to the function, as in:

int sumArray(int *p, size_t length);

But let’s get back to “no compromise.” D allows parameters to be passed by reference,
and that includes arrays of fixed length. So:

int sumArray(ref int[MAX] a) {
    int sum = 0;
    foreach (value; a)
        sum += value;
    return sum;
}

What happens here is that a, being a ref parameter, is at runtime a mere pointer. It is typed, though, to be a pointer to an array of MAX elements, and so the accesses can be array bounds checked. You don’t need to go checking the callers, as the compiler’s type system will verify that, indeed, correctly sized arrays are being passed.

“Objection!” you say. “D supports pointers. Can’t I just write it the original way? What’s to stop that from happening? I thought you said this was a mechanical guarantee!”

Yes, you can write the code as:

import core.stdc.stdio;

extern (C):   // use C ABI for declarations

enum MAX = 10;

int sumArray(int* p) {
    int sum = 0;
    for (int i = 0; i <= MAX; ++i)
        sum += p[i];
    return sum;
}

int main() {
    __gshared int[MAX] values = [ 7,10,58,62,93,100,8,17,77,17 ];
    printf("sum = %d\n", sumArray(&values[0]));
    return 0;
}

It will compile without complaint, and the awful bug will still be there. Though this time I get:

sum = 39479

which looks suspicious, but it could have just as easily printed 449 and I’d be none the wiser.

How can this be guaranteed not to happen? By adding the attribute @safe to the code:

import core.stdc.stdio;

extern (C):   // use C ABI for declarations

enum MAX = 10;

@safe int sumArray(int* p) {
    int sum = 0;
    for (int i = 0; i <= MAX; ++i)
        sum += p[i];
    return sum;
}

int main() {
    __gshared int[MAX] values = [ 7,10,58,62,93,100,8,17,77,17 ];
    printf("sum = %d\n", sumArray(&values[0]));
    return 0;
}

Compiling it gives:

sum.d(10): Error: safe function 'sum.sumArray' cannot index pointer 'p'

Granted, a code review will need to include a grep to ensure @safe is being used, but that’s about it.

In summary, this bug is vanquished by preventing an array from decaying to a pointer when passed as an argument, and is vanquished forever by disallowing indirections after arithmetic is performed on a pointer. I’m sure a rare few of you have never been blasted by buffer overflow errors. Stay tuned for the next installment in this series. Maybe your moat got breached by the next bug! (Or maybe your tool doesn’t even have a moat.)

D as a Better C

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century. This post is the first in a series about D’s BetterC mode


D was designed from the ground up to interface directly and easily to C, and to a lesser extent C++. This provides access to endless C libraries, the Standard C runtime library, and of course the operating system APIs, which are usually C APIs.

But there’s much more to C than that. There are large and immensely useful programs written in C, such as the Linux operating system and a very large chunk of the programs written for it. While D programs can interface with C libraries, the reverse isn’t true. C programs cannot interface with D ones. It’s not possible (at least not without considerable effort) to compile a couple of D files and link them in to a C program. The trouble is that compiled D files refer to things that only exist in the D runtime library, and linking that in (it’s a bit large) tends to be impractical.

D code also can’t exist in a program unless D controls the main() function, which is how the startup code in the D runtime library is managed. Hence D libraries remain inaccessible to C programs, and chimera programs (a mix of C and D) are not practical. One cannot pragmatically “try out” D by add D modules to an existing C program.

That is, until Better C came along.

It’s been done before, it’s an old idea. Bjarne Stroustrup wrote a paper in 1988 entitled “A Better C“. His early C++ compiler was able to compile C code pretty much unchanged, and then one could start using C++ features here and there as they made sense, all without disturbing the existing investment in C. This was a brilliant strategy, and drove the early success of C++.

A more modern example is Kotlin, which uses a different method. Kotlin syntax is not compatible with Java, but it is fully interoperable with Java, relies on the existing Java libraries, and allows a gradual migration of Java code to Kotlin. Kotlin is indeed a “Better Java”, and this shows in its success.

D as Better C

D takes a radically different approach to making a better C. It is not an extension of C, it is not a superset of C, and does not bring along C’s longstanding issues (such as the preprocessor, array overflows, etc.). D’s solution is to subset the D language, removing or altering features that require the D startup code and runtime library. This is, simply, the charter of the -betterC compiler switch.

Doesn’t removing things from D make it no longer D? That’s a hard question to answer, and it’s really a matter of individual preference. The vast bulk of the core language remains. Certainly the D characteristics that are analogous to C remain. The result is a language somewhere in between C and D, but that is fully upward compatible with D.

Removed Things

Most obviously, the garbage collector is removed, along with the features that depend on the garbage collector. Memory can still be allocated the same way as in C – using malloc() or some custom allocator.

Although C++ classes and COM classes will still work, D polymorphic classes will not, as they rely on the garbage collector.

Exceptions, typeid, static construction/destruction, RAII, and unittests are removed. But it is possible we can find ways to add them back in.

Asserts are altered to call the C runtime library assert fail functions rather than the D runtime library ones.

(This isn’t a complete list, for that see http://dlang.org/dmd-windows.html#switch-betterC.)

Retained Things

More importantly, what remains?

What may be initially most important to C programmers is memory safety in the form of array overflow checking, no more stray pointers into expired stack frames, and guaranteed initialization of locals. This is followed by what is expected in a modern language — modules, function overloading, constructors, member functions, Unicode, nested functions, dynamic closures, Compile Time Function Execution, automated documentation generation, highly advanced metaprogramming, and Design by Introspection.

Footprint

Consider a C program:

#include <stdio.h>

int main(int argc, char** argv) {
    printf("hello world\n");
    return 0;
}

It compiles to:

_main:
push EAX
mov [ESP],offset FLAT:_DATA
call near ptr _printf
xor EAX,EAX
pop ECX
ret

The executable size is 23,068 bytes.

Translate it to D:

import core.stdc.stdio;

extern (C) int main(int argc, char** argv) {
    printf("hello world\n");
    return 0;
}

The executable size is the same, 23,068 bytes. This is unsurprising because the C compiler and D compiler generate the same code, as they share the same code generator. (The equivalent full D program would clock in at 194Kb.) In other words, nothing extra is paid for using D rather than C for the same code.

The Hello World program is a little too trivial. Let’s step up in complexity to the infamous sieve benchmark program:

#include <stdio.h>

/* Eratosthenes Sieve prime number calculation. */

#define true    1
#define false   0
#define size    8190
#define sizepl  8191

char flags[sizepl];

int main() {
    int i, prime, k, count, iter;

    printf ("10 iterations\n");
    for (iter = 1; iter <= 10; iter++) {
        count = 0;
        for (i = 0; i <= size; i++)
            flags[i] = true;
        for (i = 0; i <= size; i++) {
            if (flags[i]) {
                prime = i + i + 3;
                k = i + prime;
                while (k <= size) {
                    flags[k] = false;
                    k += prime;
                }
                count += 1;
            }
        }
    }
    printf ("\n%d primes", count);
    return 0;
}

Rewriting it in Better C:

import core.stdc.stdio;

extern (C):

__gshared bool[8191] flags;

int main() {
    int count;

    printf("10 iterations\n");
    foreach (iter; 1 .. 11) {
        count = 0;
        flags[] = true;
        foreach (i; 0 .. flags.length) {
            if (flags[i]) {
                const prime = i + i + 3;
                auto k = i + prime;
                while (k < flags.length) {
                    flags[k] = false;
                    k += prime;
                }
                count += 1;
            }
        }
    }
    printf("%d primes\n", count);
    return 0;
}

It looks much the same, but some things are worthy of note:

  • extern (C): means use the C calling convention.
  • D normally puts static data into thread local storage. C sticks them in global storage. __gshared accomplishes that.
  • foreach is a simpler way of doing for loops over known endpoints.
  • flags[] = true; sets all the elements in flags to true in one go.
  • Using const tells the reader that prime never changes once it is initialized.
  • The types of iter, i, prime and k are inferred, preventing inadvertent type coercion errors.
  • The number of elements in flags is given by flags.length, not some independent variable.

And the last item leads to a very important hidden advantage: accesses to the flags array are bounds checked. No more overflow errors! We didn’t have to do anything
in particular to get that, either.

This is only the beginning of how D as Better C can improve the expressivity, readability, and safety of your existing C programs. For example, D has nested functions, which in my experience work very well at prying goto’s from my cold, dead fingers.

On a more personal note, ever since -betterC started working, I’ve been converting many of my old C programs still in use into D, one function at a time. Doing it one function at a time, and running the test suite after each change, keeps the program in a correctly working state at all times. If the program doesn’t work, I only have one function to look at to see where it went wrong. I don’t particularly care to maintain C programs anymore, and with -betterC there’s no longer any reason to.

The Better C ability of D is available in the 2.076.0 beta: download it and read the changelog.