Author Archives: Walter Bright

Crafting Self-Evident Code with D

Digital Mars D logo

Have you ever looked at your code from five years ago and had to study it to figure out what it was doing? And the further back in time you look, the worse it gets? Pity me, who is still maintaining code I wrote over 40 years ago. This article illustrates many simple methods of making your code self-evident and much easier to understand and maintain

To let you know what you’re fightin’ for, allow me to introduce this little gem I wrote back in 1987:

#include <stdio.h>
#define O1O printf
#define OlO putchar
#define O10 exit
#define Ol0 strlen
#define QLQ fopen
#define OlQ fgetc
#define O1Q abs
#define QO0 for
typedef char lOL;

lOL*QI[] = {"Use:\012\011dump file\012","Unable to open file '\x25s'\012",
  "\012","   ",""};

main(I,Il)
lOL*Il[];
{       FILE *L;
         unsigned lO;
         int Q,OL[' '^'0'],llO = EOF,

         O=1,l=0,lll=O+O+O+l,OQ=056;
         lOL*llL="%2x ";
         (I != 1<<1&&(O1O(QI[0]),O10(1011-1010))),
         ((L = QLQ(Il[O],"r"))==0&&(O1O(QI[O],Il[O]),O10(O)));
         lO = I-(O<<l<<O);
         while (L-l,1)
         {       QO0(Q = 0L;((Q &~(0x10-O))== l);
                         OL[Q++] = OlQ(L));
                 if (OL[0]==llO) break;
                 O1O("\0454x: ",lO);
                 if (I == (1<<1))
                 {       QO0(Q=Ol0(QI[O<<O<<1]);Q<Ol0(QI[0]);
                         Q++)O1O((OL[Q]!=llO)?llL:QI[lll],OL[Q]);/*"
                         O10(QI[1O])*/
                         O1O(QI[lll]);{}
                 }
                 QO0 (Q=0L;Q<1<<1<<1<<1<<1;Q+=Q<0100)
                 {       (OL[Q]!=llO)? /* 0010 10lOQ 000LQL */
                         ((D(OL[Q])==0&&(*(OL+O1Q(Q-l))=OQ)),
                         OlO(OL[Q])):
                         OlO(1<<(1<<1<<1)<<1);
                 }
                 O1O(QI[01^10^9]);
                 lO+=Q+0+l;}
         }
         D(l) { return l>=' '&&l<='\~';
}

Yes, this is how we wrote C code back then. I even won an award for it!

Although I am a very slow learner, I do learn over time, and gradually the code got better. You’re probably having the same issues with your code. (Or the code written by coworkers, as I agree that your code certainly does not need improvement!)

This article is about techniques that will help make code self-evident. You are probably already doing some of them. I bet there are some you aren’t. I also am sure you’re going to argue with me about some of them. Trust me, you’re wrong! If you don’t agree with me now, you will if you’re still programming five years hence.

I know you’re busy, so let’s jump right in with an observation:

“Anybody can write complicated code. It takes genius to write simple code.”

or, if you prefer:

“The highest accolade your code can garner is: oh pshaw, anybody could have
written that!”

For example, since I started as an aerospace engineer:

consider this lever commonly found in aircraft cockpits. No fair if you already know what it does. Examine it casually. What is it for?

.

.

.

It raises and lowers the landing gear. What’s the clue? It’s got a little tire for a knob! Pull the lever up, and the gear gets sucked up. Push it down, the gear goes down. Is that self-evident or what? It’s a masterpiece of simplicity. It doesn’t even need any labels. If the cockpit is filled with smoke, or you’re focused on what’s outside the window, your hand knows immediately it’s on the gear lever—not the flaps or the throttles or the copilot’s ejection seat (just kidding). This kind of stupid simple control is what cockpit designers strive for because pulling the right lever is literally a life-and-death decision. I mean literally in the literal sense of the word.

This is what we desperately want to achieve in programming. Stupid simple. We’ll probably fail, but the closer the better.

Diving in…

Just Shoot Me Now

#define BEGIN {
#define END   }

Believe it or not, this was common C practice back in the 1980s. It falls into the category of “Don’t try to make your new language look like your previous language”. This problem appears in multiple guises. I still tend to name variables in Fortran style from back when the oceans hadn’t yet formed. Before moving to D, I realized that using C macros to invent a personal custom language on top of C was an abomination. Removing it and replacing it with ordinary C code was a big improvement in clarity.

Don’t Reinvent bool

Learn what bool is and use it as intended. Accept that the following are all the same:

false true
0 1
no yes
off on
0 volts 5 volts

And that this makes code unequivocally worse:

enum { No, Yes };

Just use false and true. Done. And BTW,

enum { Yes, No };

is just an automatic “no hire” decision, as if (Yes) will thoroughly confuse everyone. If you’ve done this, run and fix it before someone curses your entire ancestry.

Horrors Blocked by D

D’s syntax has been designed to prevent some types of coding horrors.

C++ Regex expressions with operator overloading

I’m not even going to link to this. It can be found with diligent searching. What it does is use operator overloading to make ordinary-looking C++ code actually be a regex! It violates the principle that code should not pretend to be in another language. Imagine the inept error messages the compiler will bless you with if there’s a coding mistake with this. D makes this hard to do by only allowing arithmetic operators to be overloaded, which disallows things like an overloaded unary *.

(It’s harder, but still possible, to abuse operator overloading in D. But fortunately, making it harder has largely discouraged it.)

Metaprogramming with macros

Many people have requested macros be added to D. We’ve resisted this because macros inevitably result in people inventing their own custom, undocumented language layered over D. This makes it impractical for anyone else to make use of this code. In my not-so-humble opinion, macros are the reason why Lisp has never caught on in the mainstream. No Lisper can read anyone else’s Lisp code.

C++ Argument Dependent Lookup

Nobody knows what symbol will actually be found. ADL was added so one could do operator overloading on the left operand. D just has a simple syntax for left or right operand overloading.

SFINAE

Nobody knows if SFINAE is in play or not for any particular expression.

Floor Wax or Tasty Dessert Topping

This refers to the confusion between a struct being a value type or a reference type or some chimera of both. In D, a struct is a value type and a class is a reference type. To be fair, some people still try to build a D chimera type, but they should be cashiered.

Multiple inheritance

Nobody has ever made a convincing case for why this is needed. Things get really nasty when diamond inheritance is used. Pity the next guy and avoid the temptation. D has multiple inheritance for interfaces only, which has proved to be more than adequate.

Code Flow

Code should flow from left to right, and top to bottom. Just like how this article is read.

f() + g() // which executes first?

Fortunately, D guarantees a left-to-right ordering (C does not). But what about:

g(f(e(d(c(b(a))),3)));

That executes inside out! Quick, which function call does the 3 get passed to? D’s Universal Function Call Syntax to the rescue:

a.b.c.d(3).e.f.g;

That’s the equivalent, but execution flows clearly left-to-right. Is this an extreme example, or the norm?

import std.stdio;
import std.array;
import std.algorithm;

void main() {
     stdin.byLine(KeepTerminator.yes).
     map!(a => a.idup).
     array.
     sort.
     copy(stdout.lockingTextWriter());
}

This code reads from stdin by lines, puts the lines in an array, sorts the array, and writes the sorted result to stdout. It doesn’t quite meet our “stupid simple” criteria, but it is pretty close. All with a nice flow from left to right and top to bottom.

The example also nicely segues into the next observation.

The More Control Paths, the Less Understandable

Shaw: You know a great deal about computers, don’t you?
Mr. Spock: I know all about them.

I submit that:

version (X)
     doX();
doY();
if (Z)
     doZ();

is less comprehensible than:

doX();
doY();
doZ();

What happened to the conditional expressions? Move them to the interiors of doX() and doZ().

I know what you’re thinking. “But Walter, you didn’t eliminate the conditional expressions, you just moved them!” Quite right, but those conditional expressions properly belong in the functions, rather than enclosing those functions. They are part of the conceptual encapsulation a function provides, so the caller is clean.

Negation

Negation in English:

Dr McCoy: We’re trying to help you, Oxmyx.
Bela Oxmyx: Nobody helps nobody but himself.
Mr. Spock: Sir, you are employing a double negative.

Cowardly Lion: Not nobody! Not nohow!

Negation in English is often used as emphasis, rather than logical negation. Our perception of negation is fuzzy and fraught with error. This is something propagandists use to smear someone.

What the propagandist says: “Bob is not a drunkard!”

What the audience hears: “Bob is a drunkard!”

Skilled communicators avoid negation. Savvy programmers do, too. How many times have you overlooked a not operator? I have many times.

if (!noWay)

is inevitably perceived as:

if (noWay)

I mentioned this discovery to my good friend Andrei Alexandrescu. He didn’t buy it. He said I needed research to back it up. I didn’t have any research, but didn’t change my mind (i.e., hubris). Eventually, I did run across a paper that did do such research and came to the same conclusion as my assumption. I excitedly sent it to Andrei, and to his great credit, he conceded defeat, which is why Andrei is an exceptional man (rare is the person who ever concedes defeat!).

The lesson here is to avoid using negation in identifiers if at all possible.

if (way)

Isn’t that better?

DMD Source Code Hall of Shame

My own code is hardly a paragon of virtue. Some identifiers:

  • tf.isnothrow
  • IsTypeNoreturn
  • Noaccesscheck
  • Ignoresymbolvisibility
  • Include.notComputed
  • not nothrow

I have no excuse and shall have myself flagellated with a damp cauliflower. Did I say I didn’t like the code I wrote five years ago?

This leads us to the D version conditional.

Negation and version

D version conditionals are very simple:

version ( Identifier )

Identifier is usually predefined by the compiler or the command line. Only an identifier is allowed—no negation, AND, OR, or XOR. (Collectively call that version algebra.) Our users often chafe at this restriction, and I get that it’s difficult to accept the rationale at first. It’s not impossible to do version algebra:

version (A) { } else {
    // !A
}

version (A) version (B) {
    // A && B
}

version (A) version = AorB;
version (B) version = AorB;
version (AorB) {
     // A || B
}

and so forth. But it’s clumsy and unattractive on purpose. Why would D do such a thing? It’s meant to encourage thinking about versions in a positive manner. Suppose a project has a Windows and an OSX build:

version (Windows) {
     ...
}
else version (OSX) {
     ...
}
else
     static assert(0, "unsupported operating system");

Isn’t that better than this:

...
version (!Windows){
...
}

I’ve seen an awful lot of that style in C. It makes it pointlessly difficult to add support for a new operating system. After all, what the heck is the “not Windows” operating system? That really narrows things down! The former snippet makes it so much easier.

Taking this a step further:

if (A && B && C && D)

if (A || B || C || D)

are easy for a human to parse. Encountering:

if (A && (!B || C))

is always like transitioning from smooth asphalt to a cobblestone road. Ugh. I’ve made mistakes with such constructions all the time. Not only is it hard to even see the !, but it’s still hard to satisfy yourself that it is correct.

Fortunately, De Morgan’s Theorem can sometimes come to the rescue:

(!A && !B) => !(A || B)
(!A || !B) => !(A && B)

It gets rid of one negation. Repeated application can often transform it into a much more easily understood equation while being equally correct.

Anecdote: When designing digital logic circuits, the NAND gate is more efficient than the AND gate because it has one less transistor. (AND means (A && B), NAND means !(A && B)). But humans just stink at crafting bug-free NAND logic. When I worked on the design of the ABEL programming language back in the 1980s, which was for programming Programmable Logic Devices, ABEL would accept input in positive logic. It would use De Morgan’s theorem to automatically convert it to efficient negative logic. The electronics designers loved it.

To sum up this section, here’s a shameful snippet from Ubuntu’s unistd.h:

#if defined __USE_BSD || (defined __USE_XOPEN && !defined __USE_UNIX98)

Prof Marvel: I can’t bring it back, I don’t know how it works!

Casts Are Bugs

Casts subvert the protections of the typing system. Sometimes you just gotta have them (to implement malloc, for example, the result needs a cast), but far too often they are simply there to correct sloppy misuse of types. Hence, in D casts are done with the keyword cast, not a peculiar syntax, making them easily greppable. It’s worthwhile to occasionally grep a code base for cast and see if the types can be reworked to eliminate the need for the cast and have the type system working for rather than against you.

Pull Request: remove some dyncast calls

Self-Documenting Function Declarations

char* xyzzy(char* p)
  • Does p modify what it points to?
  • Is p returned?
  • Does xyzzy free p?
  • Does xyzzy save p somewhere, like in a global?
  • Does xyzzy throw p?

These crucial bits of information are rarely noted in the documentation for the function. Even worse, the documentation often gets it wrong! What is needed is self-documenting code that is enforced by the compiler. D has attributes to cover this:

const char* xyzzy(return scope const char* p)
  • p doesn’t modify what it points to
  • p is returned
  • p is not free’d
  • xyzzy does not squirrel away a copy of p
  • p is not thrown in an exception

This is all documentation that now isn’t necessary to write, and the compiler will check its accuracy for you. Yes, it is called “attribute soup” for good reason, and takes some getting used to, but it’s still better than bad documentation, and adding attributes is optional.

Function Arguments and Returns

Function inputs and outputs present in the function declaration are the “front door”. Any inputs and outputs that are not in the function declaration are “side doors”. Side doors include things like global variables, environment variables, getting information from the operating system, reading/writing files, throwing exceptions, etc. Side doors are rarely accounted for in the documentation. The poor sap calling a function has to carefully read its implementation to discern what the side doors are.

Self-evident code should strive to run everything through the front door. Not only does this help with comprehension, but it also enables delightful things like easy unit testing.

Memory Allocation

An ongoing problem faced by functions that implement an algorithm that needs to allocate memory is what memory allocation scheme should be used. Often a reusable function imposes the memory allocation method on the caller. That’s backward.

For memory that is allocated and free’d by the function, the solution is that the function decides how to do it. For allocated objects that are returned by the function, the caller should decide the allocation scheme by passing an argument that specifies it. This argument often takes the form of a “sink” to send the output to. More on that later.

Pass Abstract “sink” for Output

The auld way (extracted from the DMD source code):

import dmd.errors;
void gendocfile(Module m) {
     ...
     if (!success)
         error("expansion limit");
}

error() is a function that error messages are sent to. This is a typical formulation seen in conventional code. The error message is going out through the side door. The caller of gendocfile() has no say in what’s done with the error message, and the fact that it even generates error messages is usually omitted by the documentation. Worse, the error message emission makes it impractical to properly unit test the function.

A better way is to pass an abstract interface “sink” as a parameter and send the error messages to the sink:

import dmd.errorsink;
void gendocfile(Module m, ErrorSink eSink) {
     ...
     if (!success)
         eSink.error("expansion limit");
}

Now the caller has total control of what happens to the error messages, and it is implicitly documented. A unit tester can provide a special implementation of the interface to suit testing convenience.

Here’s a real-world PR making this improvement:

doc.d: use errorSink

Pass Files as Buffers Rather than Files to Read

Typical code I’ve written, where the file names are passed to a function to read and process them:

void gendocfile(Module m, const(char)*[] docfiles) {
     OutBuffer mbuf;
     foreach (file; ddocfiles) {
         auto buffer = readFile(file.toDString());
         mbuf.write(buffer.data);
     }
     ...
}

This kind of code is a nuisance to unit test, as adding file I/O to the unit tester is very clumsy, and, as a result, no unit tests get written. Doing file I/O is usually irrelevant to the function, anyway. It just needs the data to operate on.

The fix is to pass the contents of the file in an array:

void gendocfile(Module m, const char[] ddoctext) {
     ...
}

The PR: move ddoc file reads out of doc.d

Write to Buffer, Caller Writes File

A typical function that processes data and writes the result to a file:

void gendocfile(Module m) {
     OutBuffer buf;
     ... fill buf ...
     writeFile(m.loc, m.docfile.toString(), buf[ ]);
}

By now, you know that the caller should write the file:

void gendocfile(Module m, ref OutBuffer outbuf) {
     ... fill outbuf ...
}

And the PR:
doc.d: move file writes to caller

Move Environment Calls to Caller

Here’s a function that obtains input from the environment:

void gendocfile(Module m) {
     char* p = getenv("DDOCFILE");
     if (p)
         global.params.ddoc.files.shift(p);
}

It should be pretty obvious by now what is wrong with that. PR to move the environment read to the caller and then pass the info through the front door:

move DDOCFILE from doc.d to main.d

Use Pointers to Functions (or Templates)

I was recently working on a module that did text processing. One thing it needed to do was identify the start of an identifier string. Since Unicode is complicated, it imported the (rather substantial) module that handled Unicode. But it bugged me that all that was needed was to determine the start of an identifier; the text processor needed no further knowledge of Unicode.

It finally occurred to me that the caller could just pass a function pointer as an argument to the text processor, and the text processor would need no knowledge whatsoever of Unicode.

import dmd.doc;
bool expand(...) {
     if (isIDStart(p))
         ...
}

became:

alias fp_t = bool function(const(char)* p);
bool expand(..., fp_t isIDStart) {
     if (isIDStart(p))
         ...
}

Notice how the import just went away, improving the encapsulation and comprehensibility of the function. The function pointer could also be a template parameter, whichever is more convenient for the application. The more moats one can erect around a function, the easier it is to understand.

The PR: remove dmacro.d dependency on doc.d

Two Categories of Functions

  • Alters the state of the program

Provide a clue in the name of the function, like doAction().

  • Asks a Question

Again, a clue should be in the name. Something like isSomething(), hasCharacteristic(), getInfo(), etc. Consider making the function pure to ensure it has no side effects.

Try not to create functions that both ask a question and modify state. Over time, I’ve been gradually splitting such functions into two.

Visual Pattern Recognition

Source code formatters are great. But I don’t use them. Oof! Here’s why:

final switch (of)
{
     case elf:   lib = LibElf_factory();    break;
     case macho: lib = LibMach_factory();   break;
     case coff:  lib = LibMSCoff_factory(); break;
     case omf:   lib = LibOMF_factory();    break;
}

It turns out your brain is incredibly good at pattern recognition. By lining things up, a pattern is created. Any deviation from that pattern is likely a bug, and your eyes will be drawn to the anomaly like a stink bug to rotting fruit.

I’ve detected so much rotting fruit by using patterns, and a source code formatter doesn’t do a good job of making patterns.

Prof. Marvel: I have reached a cataclysmic decision!

Use ref instead of *

A ref is a restricted form of pointer. Arithmetic is not allowed on it, and ref parameters are not allowed to escape a function. This not only informs the user but informs the compiler, which will ensure the function is a good boy with the ref.

PR: mangleToBuffer(): use ref

Takeaways

  • Use language features as intended (don’t invent your own language
    on top of it)
  • Avoid negation
  • Left to right, top to bottom
  • Functions do everything through the front door
  • Don’t conflate engine with environment
  • Reduce cyclomatic complexity
  • Separate functions that ask a question from functions that alter state
  • Keep trying—this is a process!

The recommendations here are pretty easy to follow. It’ll rarely be necessary to do much refactoring to implement them. I hope the real-life PRs referenced here show how easy it is to make code self-evident!

Action Item

Open your latest coding masterpiece in your favorite editor. Take a good hard look at it. Sorry, it’s a steaming pile of incomprehensibility! (Join the club.)

Go fix it!

The Binary Language of Moisture Vaporators

Digital Mars D logo

I know why you’re reading this. Like other Alpha programmers, you’re not content with just compiling Vaporator code and testing to see if it works. You need to know the binary code that’s generated. But getting at it is clumsy. I want to make it easy for myself, and why not share it?

One of my earliest memories is being curious about how light bulbs worked and sticking my finger in a hot light bulb socket. I was three or four years old at the time. Later, I was always taking things apart to see how they worked. It was years before I could successfully put them back together again. I remember being baffled at the grey dust inside a resistor I cracked open and when I unwrapped the paper in a capacitor. I took my first car to pieces to see how it worked.

When I first learned Fortran, it was a great mystery how the text of the language turned into machine code. Machine code was the language of the gods. This evolved into wanting to make my own compiler. But to build a compiler, you need to be able to see the output. A disassembler had to be built along with the compiler. That became obj2asm.exe. I’ve spent a great deal of time running the disassembler and poring over what the compiler generated. I look at what other compilers generate, too, using obj2asm.

But running obj2asm is a separate process, and the output is filled with all the boilerplate needed to create a proper object file. The boilerplate is rarely of interest, and I’m only interested in the generated code for a function. Why not just give the compiler a switch, call it -vasm (short for Show Me The Vaporator Assembly), and have it emit the binary and assembler code to the screen, function by function? So I ripped the disassembler logic out of obj2asm and put it into the dmd D compiler.

One would think that the way to do this would be to have the compiler generate the assembler source code, which would then be run through an assembler like MASM or gas to create the object file. I figured this would be slow and too much work. Instead, the disassembler logic actually intercepts the binary data being written to the object file and disassembles it to a string, then prints the string to the console.

For example, for the file vaporator.d:

int demo(int x)
{
     return x * x;
}

Compiling with:

dmd vaporator.d -c -vasm

prints:

_D9vaporator4demoFiZi:
0000:   89 F8                   mov     EAX,EDI
0002:   0F AF C0                imul    EAX,EAX
0005:   C3                      ret

and we see the mangled name of the function, the code offsets, the code binary, and the mnemonic representation for those learning binary.

I am not aware of any other compiler that does this in the same way. This is probably because most programmers are not particularly interested in how the sausages are made. But I find it fascinating and fun. I’ve opined before that programmers who don’t know the assembler their code is transformed into are not likely to be Alpha programmers. With the -vasm switch, it’s so easy to look at the output, why not do it? It works as a great way to learn assembler code, too!

I’ve been using it myself, and the convenience is a game changer. What are you waiting for?

P.S. I made the disassembler as a Boost Licensed standalone module that anyone can use who needs a tool to understand the binary language of moisture vaporators.

Ownership and Borrowing in D

Digital Mars logoNearly all non-trivial programs allocate and manage memory. Getting it right is becoming increasingly important, as programs get ever more complex and mistakes get ever more costly. The usual problems are:

  1. memory leaks (failure to free memory when no longer in use)
  2. double frees (freeing memory more than once)
  3. use-after-free (continuing to refer to memory already freed)

The challenge is in keeping track of which pointers are responsible for freeing the memory (i.e. owning the memory), which pointers are merely referring to the memory, where they are, and which are active (in scope).

The common solutions are:

  1. Garbage Collection – The GC owns the memory and periodically scans memory looking for any pointers to that memory. If none are found, the memory is released. This scheme is reliable and in common use in languages like Go and Java. It tends to use much more memory than strictly necessary, have pauses, and slow down code because of inserted write gates.
  2. Reference Counting – The RC object owns the memory and keeps a count of how many pointers point to it. When that count goes to zero, the memory is released. This is also reliable and is commonly used in languages like C++ and ObjectiveC. RC is memory efficient, needing only a slot for the count. The downside of RC is the expense of maintaining the count, building an exception handler to ensure the decrement is done, and the locking for all this needed for objects shared between threads. To regain efficiency, sometimes the programmer will cheat and temporarily refer to the RC object without dealing with the count, engendering a risk that this is not done correctly.
  3. Manual – Manual memory management is exemplified by C’s malloc and free. It is fast and memory efficient, but there’s no language help at all in using them correctly. It’s entirely up to the programmer’s skill and diligence in using it. I’ve been using malloc and free for 35 years, and through bitter and endless experience rarely make a mistake with them anymore. But that’s not the sort of thing a programming shop can rely on, and note I said “rarely” and not “never”.

Solutions 2 and 3 more or less rely on faith in the programmer to do it right. Faith-based systems do not scale well, and memory management issues have proven to be very difficult to audit (so difficult that some coding standards prohibit use of memory allocation).

But there is a fourth way – Ownership and Borrowing. It’s memory efficient, as performant as manual management, and mechanically auditable. It has been recently popularized by the Rust programming language. It has its downsides, too, in the form of a reputation for having to rethink how one composes algorithms and data structures.

The downsides are manageable, and the rest of this article is an outline of how the ownership/borrowing (OB) system works, and how we propose to fit it into D. I had originally thought this would be impossible, but after spending a lot of time thinking about it I’ve found a way to fit it in, much like we’ve fit functional programming into D (with transitive immutability and function purity).

Ownership

The solution to who owns the memory object is ridiculously simple—there is only one pointer to it, so that pointer must be the owner. It is responsible for releasing the memory, after which it will cease to be valid. It follows that any pointers in the memory object are the owners of what they point to, there are no other pointers into the data structure, and the data structure therefore forms a tree.

It also follows that pointers are not copied, they are moved:

T* f();
void g(T*);
T* p = f();
T* q = p; // value of p is moved to q, not copied
g(p);     // error, p has invalid value

Moving a pointer out of a data structure is not allowed:

struct S { T* p; }
S* f();
S* s = f();
T* q = s.p; // error, can't have two pointers to s.p

Why not just mark s.p as being invalid? The trouble there is one would need to do so with a runtime mark, and this is supposed to be a compile-time solution, so attempting it is simply flagged as an error.

Having an owning pointer fall out of scope is also an error:

void h() {
  T* p = f();
} // error, forgot to release p?

It’s necessary to move the pointer somewhere else:

void g(T*);
void h() {
  T* p = f();
  g(p);  // move to g(), it's now g()'s problem
}

This neatly solves memory leaks and use-after-free problems. (Hint: to make it clearer, replace f() with malloc(), and g() with free().)

This can all be tracked at compile time through a function by using Data Flow Analysis (DFA) techniques, like those used to compute Common Subexpressions. DFA can unravel whatever rat’s nest of gotos happen to be there.

Borrowing

The ownership system described above is sound, but it is a little too restrictive. Consider:

struct S { void car(); void bar(); }
struct S* f();
S* s = f();
s.car();  // s is moved to car()
s.bar();  // error, s is now invalid

To make it work, s.car() would have to have some way of moving the pointer value back into s when s.car() returns.

In a way, this is how borrowing works. s.car() borrows a copy of s for the duration of the execution of s.car(). s is invalid during that execution and becomes valid again when s.car() returns.

In D, struct member functions take the this by reference, so we can accommodate borrowing through an enhancement: taking an argument by ref borrows it.

D also supports scope pointers, which are also a natural fit for borrowing:

void g(scope T*);
T* f();
T* p = f();
g(p);      // g() borrows p
g(p);      // we can use p again after g() returns

(When functions take arguments by ref, or pointers by scope, they are not allowed to escape the ref or the pointer. This fits right in with borrow semantics.)

Borrowing in this way fulfills the promise that only one pointer to the memory object exists at any one time, so it works.

Borrowing can be enhanced further with a little insight that the ownership system is still safe if there are multiple const pointers to it, as long as there are no mutable pointers. (Const pointers can neither release their memory nor mutate it.) That means multiple const pointers can be borrowed from the owning mutable pointer, as long as the owning mutable pointer cannot be used while the const pointers are active.

For example:

T* f();
void g(T*);
T* p = f();  // p becomes owner
{
  scope const T* q = p; // borrow const pointer
  scope const T* r = p; // borrow another one
  g(p); // error, p is invalid while q and r are in scope
}
g(p); // ok

Principles

The above can be distilled into the notion that a memory object behaves as if it is in one of two states:

  1. there exists exactly one mutable pointer to it
  2. there exist one or more const pointers to it

The careful reader will notice something peculiar in what I wrote: “as if”. What do I mean by that weasel wording? Is there some skullduggery going on? Why yes, there is. Computer languages are full of “as if” dirty deeds under the hood, like the money you deposit in your bank account isn’t actually there (I apologize if this is a rude shock to anyone), and this isn’t any different. Read on!

But first, a bit more necessary exposition.

Folding Ownership/Borrowing into D

Isn’t this scheme incompatible with the way people normally write D code, and won’t it break pretty much every D program in existence? And not break them in an easily fixed way, but break them so badly they’ll have to redesign their algorithms from the ground up?

Yup, it sure is. Except that D has a (not so) secret weapon: function attributes. It turns out that the semantics for the Ownership/Borrowing (aka OB) system can be run on a per-function basis after the usual semantic pass has been run. The careful reader may have noticed that no new syntax is added, just restrictions on existing code. D has a history of using function attributes to alter the semantics of a function—for example, adding the pure attribute causes a function to behave as if it were pure. To enable OB semantics for a function, an attribute @live is added.

This means that OB can be added to D code incrementally, as needed, and as time and resources permit. It becomes possible to add OB while, and this is critical, keeping your project in a fully functioning, tested, and releasable state. It’s mechanically auditable how much of the project is memory safe in this manner. It adds to the list of D’s many other memory-safe guarantees (such as no pointers to the stack escaping).

As If

Some necessary things cannot be done with strict OB, such as reference counted memory objects. After all, the whole point of an RC object is to have multiple pointers to it. Since RC objects are memory safe (if built correctly), they can work with OB without negatively impinging on memory safety. They just cannot be built with OB. The solution is that D has other attributes for functions, like @system. @system is where much of the safety checking is turned off. Of course, OB will also be turned off in @system code. It’s there that the RC object’s implementation hides from the OB checker.

But in OB code, the RC object looks to the OB checker like it is obeying the rules, so no problemo!

A number of such library types will be needed to successfully use OB.

Conclusion

This article is a basic overview of OB. I am working on a much more comprehensive specification. It’s always possible I’ve missed something and that there’s a hole below the waterline, but so far it’s looking good. It’s a very exciting development for D and I’m looking forward to getting it implemented.

For further discussion and comments from Walter, see the discussion threads on the /r/programming subreddit and at Hacker News.

DasBetterC: Converting make.c to D

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century. This post is the third in a series about D’s BetterC mode


D as BetterC (a.k.a. DasBetterC) is a way to upgrade existing C projects to D in an incremental manner. This article shows a step-by-step process of converting a non-trivial C project to D and deals with common issues that crop up.

While the dmd D compiler front end has already been converted to D, it’s such a large project that it can be hard to see just what was involved. I needed to find a smaller, more modest project that can be easily understood in its entirety, yet is not a contrived example.

The old make program I wrote for the Datalight C compiler in the early 1980’s came to mind. It’s a real implementation of the classic make program that’s been in constant use since the early 80’s. It’s written in pre-Standard C, has been ported from system to system, and is a remarkably compact 1961 lines of code, including comments. It is still in regular use today.

Here’s the make manual, and the source code. The executable size for make.exe is 49,692 bytes and the last modification date was Aug 19, 2012.

The Evil Plan is:

  1. Minimize diffs between the C and D versions. This is so that if the programs behave differently, it is far easier to figure out the source of the difference.
  2. No attempt will be made to fix or improve the C code during translation. This is also in the service of (1).
  3. No attempt will be made to refactor the code. Again, see (1).
  4. Duplicate the behavior of the C program as exactly and as much as possible,
    bugs and all.
  5. Do whatever is necessary as needed in the service of (4).

Once that is completed, only then is it time to fix, refactor, clean up, etc.

Spoiler Alert!

The completed conversion. The resulting executable is 52,252 bytes (quite comparable to the original 49,692). I haven’t analyzed the increment in size, but it is likely due to instantiations of the NEWOBJ template (a macro in the C version), and changes in the DMC runtime library since 2012.

Step By Step

Here are the differences between the C and D versions. It’s 664 out of 1961 lines, about a third, which looks like a lot, but I hope to convince you that nearly all of it is trivial.

The #include files are replaced by corresponding D imports, such as replacing #include <stdio.h> with import core.stdc.stdio;. Unfortunately, some of the #include files are specific to Digital Mars C, and D versions do not exist (I need to fix that). To not let that stop the project, I simply included the relevant declarations in lines 29 to 64. (See the documentation for the import declaration.)

#if _WIN32 is replaced with version (Windows). (See the documentation for the version condition and predefined versions.)

extern (C): marks the remainder of the declarations in the file as compatible with C. (See the documentation for the linkage attribute.)

A global search/replace changes uses of the debug1, debug2 and debug3 macros to debug printf. In general, #ifdef DEBUG preprocessor directives are replaced with debug conditional compilation. (See the documentation for the debug statement.)

/* Delete these old C macro definitions...
#ifdef DEBUG
-#define debug1(a)       printf(a)
-#define debug2(a,b)     printf(a,b)
-#define debug3(a,b,c)   printf(a,b,c)
-#else
-#define debug1(a)
-#define debug2(a,b)
-#define debug3(a,b,c)
-#endif
*/

// And replace their usage with the debug statement
// debug2("Returning x%lx\n",datetime);
debug printf("Returning x%lx\n",datetime);

The TRUE, FALSE and NULL macros are search/replaced with true, false, and null.

The ESC macro is replaced by a manifest constant. (See the documentation for manifest constants.)

// #define ESC     '!'
enum ESC =      '!';

The NEWOBJ macro is replaced with a template function.

// #define NEWOBJ(type)    ((type *) mem_calloc(sizeof(type)))
type* NEWOBJ(type)() { return cast(type*) mem_calloc(type.sizeof); }

The filenamecmp macro is replaced with a function.

Support for obsolete platforms is removed.

Global variables in D are placed by default into thread-local storage (TLS). But since make is a single-threaded program, they can be inserted into global storage with the __gshared storage class. (See the documentation for the __gshared attribute.)

// int CMDLINELEN;
__gshared int CMDLINELEN

D doesn’t have a separate struct tag name space, so the typedefs are not necessary. An
alias can be used instead. (See the documentation for alias declarations.) Also, struct is omitted from variable declarations.

/*
typedef struct FILENODE
        {       char            *name,genext[EXTMAX+1];
                char            dblcln;
                char            expanding;
                time_t          time;
                filelist        *dep;
                struct RULE     *frule;
                struct FILENODE *next;
        } filenode;
*/
struct FILENODE
{
        char            *name;
        char[EXTMAX1]  genext;
        char            dblcln;
        char            expanding;
        time_t          time;
        filelist        *dep;
        RULE            *frule;
        FILENODE        *next;
}

alias filenode = FILENODE;

macro is a keyword in D, so we’ll just use MACRO instead.

Grouping together multiple pointer declarations is not allowed in D, use this instead:

// char *name,*text;
// In D, the * is part of the type and 
// applies to each symbol in the declaration.
char* name, text;

C array declarations are transformed to D array declarations. (See the documentation for D’s declaration syntax.)

// char            *name,genext[EXTMAX+1];
char            *name;
char[EXTMAX+1]  genext;

static has no meaning at module scope in D. static globals in C are equivalent to private module-scope variables in D, but that doesn’t really matter when the module is never imported anywhere. They still need to be __gshared and that can be applied to an entire block of declarations. (See the documentation for the static attribute)

/*
static ignore_errors = FALSE;
static execute = TRUE;
static gag = FALSE;
static touchem = FALSE;
static debug = FALSE;
static list_lines = FALSE;
static usebuiltin = TRUE;
static print = FALSE;
...
*/

__gshared
{
    bool ignore_errors = false;
    bool execute = true;
    bool gag = false;
    bool touchem = false;
    bool xdebug = false;
    bool list_lines = false;
    bool usebuiltin = true;
    bool print = false;
    ...
}

Forward reference declarations for functions are not necessary in D. Functions defined in a module can be called at any point in the same module, before or after their definition.

Wildcard expansion doesn’t have much meaning to a make program.

Function parameters declared with array syntax are pointers in reality, and are declared as pointers in D.

// int cdecl main(int argc,char *argv[])
int main(int argc,char** argv)

mem_init() expands to nothing and we previously removed the macro.

C code can play fast and loose with arguments to functions, D demands that function prototypes be respected.

void cmderr(const char* format, const char* arg) {...}

// cmderr("can't expand response file\n");
cmderr("can't expand response file\n", null);

Global search/replace C’s arrow operator (->) with the dot operator (.), as member access in D is uniform.

Replace conditional compilation directives with D’s version.

/*
 #if TERMCODE
    ...
 #endif
*/
    version (TERMCODE)
    {
        ...
    }

The lack of function prototypes shows the age of this code. D requires proper prototypes.

// doswitch(p)
// char *p;
void doswitch(char* p)

debug is a D keyword. Rename it to xdebug.

The \n\ line endings for C multiline string literals are not necessary in D.

Comment out unused code using D’s /+ +/ nesting block comments. (See the documentation for line, block and nesting block comments.)

static if can replace many uses of #if. (See the documentation for the static if condition.)

Decay of arrays to pointers is not automatic in D, use .ptr.

// utime(name,timep);
utime(name,timep.ptr);

Use const for C-style strings derived from string literals in D, because D won’t allow taking mutable pointers to string literals. (See the documentation for const and immutable.)

// linelist **readmakefile(char *makefile,linelist **rl)
linelist **readmakefile(const char *makefile,linelist **rl)

void* cannot be implicitly cast to char*. Make it explicit.

// buf = mem_realloc(buf,bufmax);
buf = cast(char*)mem_realloc(buf,bufmax);

Replace unsigned with uint.

inout can be used to transfer the “const-ness” of a function from its argument to its return value. If the parameter is const, so will be the return value. If the parameter is not const, neither will be the return value. (See the documentation for inout functions.)

// char *skipspace(p) {...}
inout(char) *skipspace(inout(char)* p) {...}

arraysize can be replaced with the .length property of arrays. (See the documentation for array properties.)

// useCOMMAND  |= inarray(p,builtin,arraysize(builtin));
useCOMMAND  |= inarray(p,builtin.ptr,builtin.length)

String literals are immutable, so it is necessary to replace mutable ones with a stack allocated array. (See the documentation for string literals.)

// static char envname[] = "@_CMDLINE";
char[10] envname = "@_CMDLINE";

.sizeof replaces C’s sizeof(). (See the documentation for the .sizeof property).

// q = (char *) mem_calloc(sizeof(envname) + len);
q = cast(char *) mem_calloc(envname.sizeof + len);

Don’t care about old versions of Windows.

Replace ancient C usage of char * with void*.

And that wraps up the changes! See, not so bad. I didn’t set a timer, but I doubt this took more than an hour, including debugging a couple errors I made in the process.

This leaves the file man.c, which is used to open the browser on the make manual page when the -man switch is given. Fortunately, this was already ported to D, so we can just copy that code.

Building make is so easy it doesn’t even need a makefile:

\dmd2.079\windows\bin\dmd make.d dman.d -O -release -betterC -I. -I\dmd2.079\src\druntime\import\ shell32.lib

Summary

We’ve stuck to the Evil Plan of translating a non-trivial old school C program to D, and thereby were able to do it quickly and get it working correctly. An equivalent executable was generated.

The issues encountered are typical and easily dealt with:

  • Replacement of #include with import
  • Lack of D versions of #include files
  • Global search/replace of things like ->
  • Replacement of preprocessor macros with:
    • manifest constants
    • simple templates
    • functions
    • version declarations
    • debug declarations
  • Handling identifiers that are D keywords
  • Replacement of C style declarations of pointers and arrays
  • Unnecessary forward references
  • More stringent typing enforcement
  • Array handling
  • Replacing C basic types with D types

None of the following was necessary:

  • Reorganizing the code
  • Changing data or control structures
  • Changing the flow of the program
  • Changing how the program works
  • Changing memory management

Future

Now that it is in DasBetterC, there are lots of modern programming features available to improve the code:

Action

Let us know over at the D Forum how your DasBetterC project is coming along!

Vanquish Forever These Bugs That Blasted Your Kingdom

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century. This post is the second in a series about D’s BetterC mode.


Do you ever get tired of bugs that are easy to make, hard to check for, often don’t show up in testing, and blast your kingdom once they are widely deployed? They cost you time and money again and again. If you were only a better programmer, these things wouldn’t happen, right?

Maybe it’s not you at all. I’ll show how these bugs are not your fault – they’re the tools’ fault, and by improving the tools you’ll never have your kingdom blasted by them again.

And you won’t have to compromise, either.

Array Overflow

Consider this conventional program to calculate the sum of an array:

#include <stdio.h>

#define MAX 10

int sumArray(int* p) {
    int sum = 0;
    int i;
    for (i = 0; i <= MAX; ++i)
        sum += p[i];
    return sum;
}

int main() {
    static int values[MAX] = { 7,10,58,62,93,100,8,17,77,17 };
    printf("sum = %d\n", sumArray(values));
    return 0;
}

The program should print:

sum = 449

And indeed it does, on my Ubuntu Linux system, with both gcc and clang and -Wall. I’m sure you already know what the bug is:

for (i = 0; i <= MAX; ++i)
              ^^

This is the classic “fencepost problem”. It goes through the loop 11 times instead of 10. It should properly be:

for (i = 0; i < MAX; ++i)

Note that even with the bug, the program still produced the correct result! On my system, anyway. So I wouldn’t have detected it. On the customer’s system, well, then it mysteriously fails, and I have a remote heisenbug. I’m already tensing up anticipating the time and money this is going to cost me.

It’s such a rotten bug that over the years I have reprogrammed my brain to:

  1. Never, ever use “inclusive” upper bounds.
  2. Never, ever use <= in a for loop condition.

By making myself a better programmer, I have solved the problem! Or have I? Not really. Let’s look again at the code from the perspective of the poor schlub who has to review it. He wants to ensure that sumArray() is correct. He must:

  1. Look at all callers of sumArray() to see what kind of pointer is being passed.
  2. Verify that the pointer actually is pointing to an array.
  3. Verify that the size of the array is indeed MAX.

While this is trivial for the trivial program as presented here, it doesn’t really scale as the program complexity goes up. The more callers there are of sumArray, and the more indirect the data structures being passed to sumArray, the harder it is to do what amounts to data flow analysis in your head to ensure it is correct.

Even if you get it right, are you sure? What about when someone else checks in a change, is it still right? Do you want to do that analysis again? I’m sure you have better things to do. This is a tooling problem.

The fundamental issue with this particular problem is that a C array decays to a pointer when it’s an argument to a function, even if the function parameter is declared to be an array. There’s just no escaping it. There’s no detecting it, either. (At least gcc and clang don’t detect it, maybe someone has developed an analyzer that does).

And so the tool to fix it is D as a BetterC compiler. D has the notion of a dynamic array, which is simply a fat pointer, that is laid out like:

struct DynamicArray {
    T* ptr;
    size_t length;
}

It’s declared like:

int[] a;

and with that the example becomes:

import core.stdc.stdio;

extern (C):   // use C ABI for declarations

enum MAX = 10;

int sumArray(int[] a) {
    int sum = 0;
    for (int i = 0; i <= MAX; ++i)
        sum += a[i];
    return sum;
}

int main() {
    __gshared int[MAX] values = [ 7,10,58,62,93,100,8,17,77,17 ];
    printf("sum = %d\n", sumArray(values));
    return 0;
}

Compiling:

dmd -betterC sum.d

Running:

./sum
Assertion failure: 'array overflow' on line 11 in file 'sum.d'

That’s more like it. Replacing the <= with < we get:

./sum
sum = 449

What’s happening is the dynamic array a is carrying its dimension along with it and the compiler inserts an array bounds overflow check.

But wait, there’s more.

There’s that pesky MAX thing. Since the a is carrying its dimension, that can be used instead:

for (int i = 0; i < a.length; ++i)

This is such a common idiom, D has special syntax for it:

foreach (value; a)
    sum += value;

The whole function sumArray() now looks like:

int sumArray(int[] a) {
    int sum = 0;
    foreach (value; a)
        sum += value;
    return sum;
}

and now sumArray() can be reviewed in isolation from the rest of the program. You can get more done in less time with more reliability, and so can justify getting a pay raise. Or at least you won’t have to come in on weekends on an emergency call to fix the bug.

“Objection!” you say. “Passing a to sumArray() requires two pushes to the stack, and passing p is only one. You said no compromise, but I’m losing speed here.”

Indeed you are, in cases where MAX is a manifest constant, and not itself passed to the function, as in:

int sumArray(int *p, size_t length);

But let’s get back to “no compromise.” D allows parameters to be passed by reference,
and that includes arrays of fixed length. So:

int sumArray(ref int[MAX] a) {
    int sum = 0;
    foreach (value; a)
        sum += value;
    return sum;
}

What happens here is that a, being a ref parameter, is at runtime a mere pointer. It is typed, though, to be a pointer to an array of MAX elements, and so the accesses can be array bounds checked. You don’t need to go checking the callers, as the compiler’s type system will verify that, indeed, correctly sized arrays are being passed.

“Objection!” you say. “D supports pointers. Can’t I just write it the original way? What’s to stop that from happening? I thought you said this was a mechanical guarantee!”

Yes, you can write the code as:

import core.stdc.stdio;

extern (C):   // use C ABI for declarations

enum MAX = 10;

int sumArray(int* p) {
    int sum = 0;
    for (int i = 0; i <= MAX; ++i)
        sum += p[i];
    return sum;
}

int main() {
    __gshared int[MAX] values = [ 7,10,58,62,93,100,8,17,77,17 ];
    printf("sum = %d\n", sumArray(&values[0]));
    return 0;
}

It will compile without complaint, and the awful bug will still be there. Though this time I get:

sum = 39479

which looks suspicious, but it could have just as easily printed 449 and I’d be none the wiser.

How can this be guaranteed not to happen? By adding the attribute @safe to the code:

import core.stdc.stdio;

extern (C):   // use C ABI for declarations

enum MAX = 10;

@safe int sumArray(int* p) {
    int sum = 0;
    for (int i = 0; i <= MAX; ++i)
        sum += p[i];
    return sum;
}

int main() {
    __gshared int[MAX] values = [ 7,10,58,62,93,100,8,17,77,17 ];
    printf("sum = %d\n", sumArray(&values[0]));
    return 0;
}

Compiling it gives:

sum.d(10): Error: safe function 'sum.sumArray' cannot index pointer 'p'

Granted, a code review will need to include a grep to ensure @safe is being used, but that’s about it.

In summary, this bug is vanquished by preventing an array from decaying to a pointer when passed as an argument, and is vanquished forever by disallowing indirections after arithmetic is performed on a pointer. I’m sure a rare few of you have never been blasted by buffer overflow errors. Stay tuned for the next installment in this series. Maybe your moat got breached by the next bug! (Or maybe your tool doesn’t even have a moat.)

D as a Better C

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century. This post is the first in a series about D’s BetterC mode


D was designed from the ground up to interface directly and easily to C, and to a lesser extent C++. This provides access to endless C libraries, the Standard C runtime library, and of course the operating system APIs, which are usually C APIs.

But there’s much more to C than that. There are large and immensely useful programs written in C, such as the Linux operating system and a very large chunk of the programs written for it. While D programs can interface with C libraries, the reverse isn’t true. C programs cannot interface with D ones. It’s not possible (at least not without considerable effort) to compile a couple of D files and link them in to a C program. The trouble is that compiled D files refer to things that only exist in the D runtime library, and linking that in (it’s a bit large) tends to be impractical.

D code also can’t exist in a program unless D controls the main() function, which is how the startup code in the D runtime library is managed. Hence D libraries remain inaccessible to C programs, and chimera programs (a mix of C and D) are not practical. One cannot pragmatically “try out” D by add D modules to an existing C program.

That is, until Better C came along.

It’s been done before, it’s an old idea. Bjarne Stroustrup wrote a paper in 1988 entitled “A Better C“. His early C++ compiler was able to compile C code pretty much unchanged, and then one could start using C++ features here and there as they made sense, all without disturbing the existing investment in C. This was a brilliant strategy, and drove the early success of C++.

A more modern example is Kotlin, which uses a different method. Kotlin syntax is not compatible with Java, but it is fully interoperable with Java, relies on the existing Java libraries, and allows a gradual migration of Java code to Kotlin. Kotlin is indeed a “Better Java”, and this shows in its success.

D as Better C

D takes a radically different approach to making a better C. It is not an extension of C, it is not a superset of C, and does not bring along C’s longstanding issues (such as the preprocessor, array overflows, etc.). D’s solution is to subset the D language, removing or altering features that require the D startup code and runtime library. This is, simply, the charter of the -betterC compiler switch.

Doesn’t removing things from D make it no longer D? That’s a hard question to answer, and it’s really a matter of individual preference. The vast bulk of the core language remains. Certainly the D characteristics that are analogous to C remain. The result is a language somewhere in between C and D, but that is fully upward compatible with D.

Removed Things

Most obviously, the garbage collector is removed, along with the features that depend on the garbage collector. Memory can still be allocated the same way as in C – using malloc() or some custom allocator.

Although C++ classes and COM classes will still work, D polymorphic classes will not, as they rely on the garbage collector.

Exceptions, typeid, static construction/destruction, RAII, and unittests are removed. But it is possible we can find ways to add them back in.

Asserts are altered to call the C runtime library assert fail functions rather than the D runtime library ones.

(This isn’t a complete list, for that see http://dlang.org/dmd-windows.html#switch-betterC.)

Retained Things

More importantly, what remains?

What may be initially most important to C programmers is memory safety in the form of array overflow checking, no more stray pointers into expired stack frames, and guaranteed initialization of locals. This is followed by what is expected in a modern language — modules, function overloading, constructors, member functions, Unicode, nested functions, dynamic closures, Compile Time Function Execution, automated documentation generation, highly advanced metaprogramming, and Design by Introspection.

Footprint

Consider a C program:

#include <stdio.h>

int main(int argc, char** argv) {
    printf("hello world\n");
    return 0;
}

It compiles to:

_main:
push EAX
mov [ESP],offset FLAT:_DATA
call near ptr _printf
xor EAX,EAX
pop ECX
ret

The executable size is 23,068 bytes.

Translate it to D:

import core.stdc.stdio;

extern (C) int main(int argc, char** argv) {
    printf("hello world\n");
    return 0;
}

The executable size is the same, 23,068 bytes. This is unsurprising because the C compiler and D compiler generate the same code, as they share the same code generator. (The equivalent full D program would clock in at 194Kb.) In other words, nothing extra is paid for using D rather than C for the same code.

The Hello World program is a little too trivial. Let’s step up in complexity to the infamous sieve benchmark program:

#include <stdio.h>

/* Eratosthenes Sieve prime number calculation. */

#define true    1
#define false   0
#define size    8190
#define sizepl  8191

char flags[sizepl];

int main() {
    int i, prime, k, count, iter;

    printf ("10 iterations\n");
    for (iter = 1; iter <= 10; iter++) {
        count = 0;
        for (i = 0; i <= size; i++)
            flags[i] = true;
        for (i = 0; i <= size; i++) {
            if (flags[i]) {
                prime = i + i + 3;
                k = i + prime;
                while (k <= size) {
                    flags[k] = false;
                    k += prime;
                }
                count += 1;
            }
        }
    }
    printf ("\n%d primes", count);
    return 0;
}

Rewriting it in Better C:

import core.stdc.stdio;

extern (C):

__gshared bool[8191] flags;

int main() {
    int count;

    printf("10 iterations\n");
    foreach (iter; 1 .. 11) {
        count = 0;
        flags[] = true;
        foreach (i; 0 .. flags.length) {
            if (flags[i]) {
                const prime = i + i + 3;
                auto k = i + prime;
                while (k < flags.length) {
                    flags[k] = false;
                    k += prime;
                }
                count += 1;
            }
        }
    }
    printf("%d primes\n", count);
    return 0;
}

It looks much the same, but some things are worthy of note:

  • extern (C): means use the C calling convention.
  • D normally puts static data into thread local storage. C sticks them in global storage. __gshared accomplishes that.
  • foreach is a simpler way of doing for loops over known endpoints.
  • flags[] = true; sets all the elements in flags to true in one go.
  • Using const tells the reader that prime never changes once it is initialized.
  • The types of iter, i, prime and k are inferred, preventing inadvertent type coercion errors.
  • The number of elements in flags is given by flags.length, not some independent variable.

And the last item leads to a very important hidden advantage: accesses to the flags array are bounds checked. No more overflow errors! We didn’t have to do anything
in particular to get that, either.

This is only the beginning of how D as Better C can improve the expressivity, readability, and safety of your existing C programs. For example, D has nested functions, which in my experience work very well at prying goto’s from my cold, dead fingers.

On a more personal note, ever since -betterC started working, I’ve been converting many of my old C programs still in use into D, one function at a time. Doing it one function at a time, and running the test suite after each change, keeps the program in a correctly working state at all times. If the program doesn’t work, I only have one function to look at to see where it went wrong. I don’t particularly care to maintain C programs anymore, and with -betterC there’s no longer any reason to.

The Better C ability of D is available in the 2.076.0 beta: download it and read the changelog.

Snowflake Strings

Walter Bright is the BDFL of the D Programming Language and founder of Digital Mars. He has decades of experience implementing compilers and interpreters for multiple languages, including Zortech C++, the first native C++ compiler. He also created Empire, the Wargame of the Century.


Consider the following D code in file.d:

int foo(int i) {
    assert(i < 3);
    return i;
}

This is equivalent to the C code:

#include <assert.h>

int foo(int i) {
    assert(i < 3);
    return i;
}

The assert() in the D code is “lowered” (i.e. rewritten by the compiler) to the following:

(i < 3 || _d_assertp("file.d", 2))

We’re interested in how the compiler writes that string literal, "file.d" to the generated object file. The most obvious implementation is to write the characters into the data section and push the address of that to call _d_assertp().

Indeed, that does work, and it’s tempting to stop there. But since we’re professional compiler nerds obsessed with performance, details, etc., there’s a lot more oil in that olive (to borrow one of Andrei’s favorite sayings). Let’s put it in a press and start turning the screw, because assert()s are used a lot.

First off, string literals are immutable (they were originally mutable in C and C++, but are no longer, and D tries to learn from such mistakes). This suggests the string can
be put in read-only memory. What advantages does that deliver?

  • Read-only memory is, well, read only. Attempts to write to it are met with a seg fault exception courtesy of the CPUs memory management logic. Seg faults are a bit graceless, like a cop pulling you over for driving on the wrong side of the road, but at least there wasn’t a head-on collision or corruption of the program’s memory.
  • Read-only pages are never swapped out by the virtual memory system; they never get marked as “dirty” because they are never written to. They may get discarded and reloaded, but that’s much less costly.
  • Read-only memory is safe from corruption by malware (unless the malware infects the MMU, sigh).
  • Read-only memory in a shared library is shared – copies do not have to be made for each user of the shared library.
  • Read-only memory does not need to be scanned for pointers to the heap by the garbage collection system (if the application does GC).

Essentially, shoving as much as possible into read-only memory is good for performance, size and safety. There’s the first drop of oil.

The next issue crops up as soon as there’s more than one assert:

int foo(int i, int j) {
    assert(i < 3);
    assert(j & 1);
    return i + j;
}

The compiler emits two copies of "file.d" into the object file. Since they’re identical, and read-only, it makes sense to only emit one copy:

string TMP = "file.d";
int foo(int i, int j) {
    (i < 3 || _d_assertp(TMP, 2))
    (j & 1 || _d_assertp(TMP, 3))
    return i + j;
}

This is called string pooling and is fairly simple to implement. The compiler maintains a hash table of string literals and their associated symbol names (TMP in this case).

So far, this is working reasonably well. But when asserts migrate into header files, macros, and templates, the same string can appear in lots of object files, since the compiler doesn’t know what is happening in other object files (the separate compilation model). Other string literals can exhibit this behavior, too, when generic coding practices are used. There needs to be some way to present these in the object file so the linker can pool identical strings.

The dmd D compiler currently supports four different object file formats on different platforms:

  • Elf, for Linux and FreeBSD
  • Mach-O, for OSX
  • MS-COFF, for Win64
  • OMF, for Win32

Each does it in a different way, with different tradeoffs. The methods tend to be woefully under documented, and figuring this stuff out is why I get paid the big bucks.

Elf

Elf turns out to have a magic section just for this purpose. It’s named .rodata.strM.N where N is replace by the number of bytes a character has, and M is the alignment. For good old char strings, that would be .rodata.str1.1. The compiler just dumps the strings into that section, and the Elf linker looks through it, removing the redundant strings and adjusting the relocations accordingly. It’ll handle the usual string types – char, wchar, and dchar – with aplomb.

There’s just a couple flaws. The end of a string is determined by a nul character. This means that strings cannot have embedded nuls, or the linker will regard them as multiple strings and shuffle them about in unexpected ways. One cannot have relocations in those sections, either. This means it’s only good for C string literals, not other kinds of data.

This poses a problem for D, where the strings are length-delineated strings, not nul-terminated ones. Does this mean D is doomed to being unable to take advantage of the C-centric file formats and linker design? Not at all. The D compiler simply appends a nul when emitting string literals. If the string does have an embedded nul (allowed in D), it is not put it in these special sections (and the benefit is lost, but such strings are thankfully rare).

Mach-O

Mach-O uses a variant of the Elf approach, a special section named __cstring. It’s more limited in that it only works with single byte chars. No wchar_ts for you! If there ever was confirmation that UTF-16 and UTF-32 are dead end string types, this should be it.

MS-COFF

Microsoft invented MS-COFF by extending the old Unix COFF format. It has many magic sections, but none specifically for strings. Instead, it uses what are called COMDAT sections, one for each string. COMDATs are sections tagged with a unique name, and when the linker is faced with multiple COMDATs with the same name, one is picked and all references to the other COMDATs are rewritten to refer to the Anointed One. COMDATs first appeared in object formats with the advent of C++ templates, since template code generation tends to generate the same code over and over in separate files.

(Isn’t it interesting how object file formats are driven by the needs of C and C++?)

The COMDAT for "hello" would look something like this:

??_C@_05CJBACGMB@hello?$AA@:
db 'h', 'e', 'l', 'l', 'o', 0

The tty noise there is the mangled name of the COMDAT which is generated from the string literal’s contents. The algorithm must match across compilation units, as that is how the linker decides which ones are the same (experimenting with it will show that the substring CJBACGMB is some sort of hash). Microsoft’s algorithm for the mangling and hash is undocumented as far as I can determine, but it doesn’t matter anyway, it only has to have a 1:1 mapping between name and string literal. That screams “MD5 hash” to me, so that’s what dmd does. The name is an MD5 hash of the string literal contents, which also has the nice property that no matter how large the string gets, the identifier doesn’t grow.

COMDATs can have anything stuffed in them, so this is a feature that is usable for a lot more than just strings.

The downside of the COMDAT scheme is the space taken up by all those names, so shipping a program with the debug symbols in it could get much larger.

OMF

The caboose is OMF, an ancient format going back to the early 80’s. It was extended with a kludgy COMDAT system to support C++ just before most everyone abandoned it. DMD still emits it for Win32 programs. We’re stuck with it because it’s the only format the default linker (OPTLINK) understands, and so we find a way to press it into service.

Since it has COMDATs, that’s the mechanism used. The wrinkle is that COMDATs are code sections or data sections only; there are no other options. We want it to be read-only, so the string COMDATs are emitted as code sections (!). Hey, it works.

Conclusion

I don’t think we’ve pressed all the oil out of that olive yet. It may be like memcpy, where every new crop of programmers thinks of a way to speed it up.

I hope you’ve enjoyed our little tour of literals, and may all your string literals be unique snowflakes.

Thanks to Mike Parker for his help with this article.