Author Archives: Andrei Alexandrescu

Lomuto’s Comeback

The Continental Club in Austin, Texas, USA
Sunday, January 5, 1987

“Thank you for your kind invitation, Mr. Lomuto. I will soon return to England so this is quite timely.”

“And thanks for agreeing to meeting me, Mister… Sir… Charles… A.R… Hoare. It’s a great honor. I don’t even know how to address you. Were you knighted?”

“Call me Tony, and if it’s not too much imposition please allow me to call you Nico.”

On the surface, a banal scene—two men enjoying a whiskey. However, a closer look revealed a number of intriguing details. For starters, a tension you could cut with a knife.

Dressed in a perfectly tailored four-piece suit worn with the nonchalance only an Englishman could pull off, Tony Hoare was as British as a cup of tea. His resigned grimaces as he was sipping from his glass spoke volumes about his opinion of Bourbon versus Scotch. On the other side of the small table, Nico Lomuto couldn’t have been more different: a casually dressed coder enjoying his whiskey with Coca-Cola (a matter so outrageous that Tony had decided early on to studiously pretend not to notice, as he would when confronted with ripe body odor or an offensive tattoo), in a sort of relaxed awe at the sight of the Computer Science giant he had just met.

“Listen, Tony,” Nico said as the chit chat petered off, “about that partitioning algorithm. I never meant to publish or—”

“Oh? Yes, yes, the partitioning algorithm.” Tony’s eyebrows rose with feigned surprise, as if it had escaped his mind that every paper and book on quicksort in the past five years mentioned their names together. It was obviously the one thing connecting the two men and the motivation of the meeting, but Tony, the perfect gentleman, could talk about the weather for hours with a pink elephant in the room if his conversation partner didn’t bring it up.

“Yeah, that partitioning algorithm that keeps on getting mentioned together with yours,” Nico continued. “I’m not much of an algorithms theorist. I’m working on Ada, and this entire thing about my partition scheme is a distraction. The bothersome part about it”—Nico was speaking in the forthcoming tone of a man with nothing to hide—”is that it’s not even a better algorithm. My partitioning scheme will always do the same number of comparisons and at least as many swaps as yours. In the worst case, mine does n additional swaps—n! I can’t understand why they keep on mentioning the blessed thing. It’s out of my hands now. I can’t tell them what algorithms to teach and publish. It’s like bubblesort. Whenever anyone mentions quicksort, there’s some chowderhead—or should I say bubblehead—in the audience going, yes, I also heard of the bubblesort algorithm. Makes my blood curdle.”

Nico sighed. Tony nodded. Mutual values. Rapport filled the air in between as suddenly, quietly, and pleasantly as the smell of cookies out of the oven. A few seconds went by. Jack and Coke sip. On the other side of the table, Bourbon sip, resigned grimace.

Tony spoke with the carefully chosen words of a scientist who wants to leave no hypothesis unexplored. “I understand, Nico. Yet please consider the following. Your algorithm is simple and regular, moves in only one direction, and does at most one swap per step. That may be appropriate for some future machines that…”

“No matter the machine, more swaps can’t be better than fewer swaps. It’s common sense,” Nico said, peremptorily.

“I would not be so sure. Computers do not have common sense. Computers are surprising. It stands to reason they’ll continue to be. Well, how about we enjoy this evening. Nothing like a good conversation in a quiet club.”

“Yeah. Cheers. This is a fun place. I hear they’ll have live country music soon.”

“Charming.” Somewhat to his own surprise, Tony mustered a polite smile.

Chestnut Hill, Massachusetts, USA
Present Day

I’ve carried an unconfessed addiction to the sorting problem for many years. Wasn’t that difficult to hide—to a good extent, an obsessive inclination to studying sorting is a socially tolerated déformation professionnelle; no doubt many a programmer has spent a few late nights trying yet another sorting idea that’s going to be so much better than the others. So nobody raised an eyebrow when I wrote about sorting all the way back in 2002 (ever heard about “fit pivot?” Of course you didn’t). There was no intervention organized when I wrote D’s std.sort, which turned out to be sometimes quadratic (and has been thankfully fixed since). No scorn even when I wrote an academic paper on the selection problem (sort’s cousin) as an unaffiliated outsider, which even the conference organizers said was quite a trick. And no public outrage when I spoke about sorting at CppCon 2019. Coders understand.

So, I manage. You know what they say—one day at a time. Yet I did feel a tinge of excitement when I saw the title of a recent paper: “Branch Mispredictions Don’t Affect Mergesort.” Such an intriguing title. To start with, are branch mispredictions expected to affect mergesort? I don’t have much of an idea, mainly because everybody and their cat is using quicksort, not mergesort, so the latter hasn’t really been at the center of my focus. But hey, I don’t even need to worry about it because the title resolutely asserts that that problem I didn’t know I was supposed to worry about, I don’t need to worry about after all. So in a way the title cancels itself out. Yet I did read the paper (and recommend you do the same) and among many interesting insights, there was one that caught my attention: Lomuto’s partitioning scheme was discussed as a serious contender (against the universally-used Hoare partition) from an efficiency perspective. Efficiency!

It turns out modern computing architecture does, sometimes, violate common sense.

To Partition, Perchance to Sort

Let’s first recap the two partitioning schemes. Given an array and a pivot element, to partition the array means to arrange elements of the array such that all elements smaller than or equal to the pivot are on the left, and elements greater than or equal to the pivot are on the right. The final position of the pivot would be at the border. (If there are several equivalent pivot values that final pivot position may vary, with important practical consequences; for this discussion, however, we can assume that all array values are distinct.)

Lomuto’s partitioning scheme walks the array left to right maintaining a “read” position and a “write” position, both initially at 0. For each element read, if the value seen by the “read head” is greater than the pivot, it gets skipped (with the read head moving to the next position). Otherwise, the value at the read head is swapped with that at the write head, and both heads advance by one position. When the read head is done, the position of the write head defines the partition. Refer to the nice animation below (from Wikipedia user Mastremo, used unmodified under the CC-BY-SA 3.0 license).

The problem with Lomuto’s partition is that it may do unnecessary swaps. Consider the extreme case of an array with only the leftmost element greater than the pivot. That element will be awkwardly moved to the right one position per iteration step, in a manner not unlike, um, bubblesort.

Hoare’s partitioning scheme elegantly solves that issue by iterating concomitantly from both ends of the array with two “read/write heads”. They skip elements that are already appropriately placed (less than the pivot on the left, greater than the pivot on the right), and swap only one smaller element from the left with one greater element from the right. When the two heads meet, the array is partitioned around the meet point. The extreme case described above is handled with a single swap. Most contemporary implementations of quicksort use Hoare partition, for obvious reasons: it does as many comparisons as the Lomuto partition and fewer swaps.

Given that Hoare partition clearly does less work than Lomuto partition, the question would be why ever teach or use the latter at all. Alexander Stepanov, the creator of the STL, authored a great discussion about partitioning and makes a genericity argument: Lomuto partition only needs forward iterators, whereas Hoare partition requires bidirectional iterators. That’s a valuable insight, albeit of limited practical utility: yes, you could use Lomuto’s partition on singly-linked lists, but most of the time you partition for quicksort’s sake, and you don’t want to quicksort singly-linked lists; mergesort would be the algorithm of choice.

Yet a very practical—and very surprising—argument does exist, and is the punchline of this article: implemented in a branch-free manner, Lomuto partition is a lot faster than Hoare partition on random data. Given that quicksort spends most of its time partitioning, it follows that we are looking at a hefty improvement of quicksort (yes, I am talking about industrial strength implementations for C++ and D) by replacing its partitioning algorithm with one that literally does more work.

You read that right.

Time to Spin Some Code

To see how the cookie crumbles, let’s take a look at a careful implementation of Hoare partition. To eliminate all extraneous details, the code in this article is written for long as the element type and uses raw pointers. It compiles and runs the same with a C++ or D compiler. This article will carry along implementations of all routines in both languages because much research literature measures algorithm performance using C++’s std::sort as an important baseline.

/**
Partition using the minimum of the first and last element as pivot.
Returns: a pointer to the final position of the pivot.
*/
long* hoare_partition(long* first, long* last) {
    assert(first <= last);
    if (last - first < 2)
        return first; // nothing interesting to do
    --last;
    if (*first > *last)
        swap(*first, *last);
    auto pivot_pos = first;
    auto pivot = *pivot_pos;
    for (;;) {
        ++first;
        auto f = *first;
        while (f < pivot)
            f = *++first;
        auto l = *last;
        while (pivot < l)
            l = *--last;
        if (first >= last)
            break;
        *first = l;
        *last = f;
        --last;
    }
    --first;
    swap(*first, *pivot_pos);
    return first;
}

(You may find the choice of pivot a bit odd, but not to worry: usually it’s a more sophisticated scheme—such as median-of-3—but what’s important to the core loop is that the pivot is not the largest element of the array. That allows the core loop to omit a number of limit conditions without running off array bounds.)

There are a lot of good things to say about the efficiency of this implementation (which you’re likely to find, with minor details changed, in implementations of the C++ or D standard library). You could tell the code above was written by people who live trim lives. People who keep their nails clean, show up when they say they’ll show up, and call Mom regularly. They do a wushu routine every morning and don’t let computer cycles go to waste. That code has no slack in it. The generated Intel assembly is remarkably tight and virtually identical for C++ and D. It only varies across backends, with LLVM at a slight code size advantage (see clang and ldc) over gcc (see g++ and gdc).

The initial implementation of Lomuto’s partition shown below works well for exposition, but is sloppy from an efficiency perspective:

/**
Choose the pivot as the first element, then partition.
Returns: a pointer to the final position of the pivot. 
*/
long* lomuto_partition_naive(long* first, long* last) {
    assert(first <= last);
    if (last - first < 2)
        return first; // nothing interesting to do
    auto pivot_pos = first;
    auto pivot = *first;
    ++first;
    for (auto read = first; read < last; ++read) {
        if (*read < pivot) {
            swap(*read, *first);
            ++first;
        }
    }
    --first;
    swap(*first, *pivot_pos);
    return first;
}

For starters, the code above will do a lot of silly no-op swaps (array element with itself) if a bunch of elements on the left of the array are greater than the pivot. All that time first==write, so swapping *first with *write is unnecessary and wasteful. Let’s fix that with a pre-processing loop that skips the uninteresting initial portion:

/**
Partition using the minimum of the first and last element as pivot. 
Returns: a pointer to the final position of the pivot.
*/
long* lomuto_partition(long* first, long* last) {
    assert(first <= last);
    if (last - first < 2)
        return first; // nothing interesting to do
    --last;
    if (*first > *last)
        swap(*first, *last);
    auto pivot_pos = first;
    auto pivot = *first;
    // Prelude: position first (the write head) on the first element
    // larger than the pivot.
    do {
        ++first;
    } while (*first < pivot);
    assert(first <= last);
    // Main course.
    for (auto read = first + 1; read < last; ++read) {
        auto x = *read;
        if (x < pivot) {
            *read = *first;
            *first = x;
            ++first;
        }
    }
    // Put the pivot where it belongs.
    assert(*first >= pivot);
    --first;
    *pivot_pos = *first;
    *first = pivot;
    return first;
}

The function now chooses the pivot as the smallest of first and last element, just like hoare_partition. I also made another small change—instead of using the swap routine, let’s use explicit assignments. The optimizer takes care of that automatically (enregistering plus register allocation for the win), but expressing it in source helps us see the relatively expensive array reads and array writes. Now for the interesting part. Let’s focus on the core loop:

for (auto read = first + 1; read < last; ++read) {
    auto x = *read;
    if (x < pivot) {
        *read = *first;
        *first = x;
        ++first;
    }
}

Let’s think statistics. There are two conditionals in this loop: read < last and x < pivot. How predictable are they? Well, the first one is eminently predictable—you can reliably predict it will always be true, and you’ll only be wrong once no matter how large the array is. Compiler writers and hardware designers know this, and design the fastest path under the assumption loops will continue. (Gift idea for your Intel engineer friend: a doormat that reads “The Backward Branch Is Always Taken.”) The CPU will speculatively start executing the next iteration of the loop even before having decided whether the loop should continue. That work will be thrown away only once, at the end of the loop. That’s the magic of speculative execution.

Things are quite a bit less pleasant with the second test, x < pivot. If you assume random data and a randomly-chosen pivot, it could go either way with equal probability. That means speculative execution is not effective at all, which is very bad for efficiency. How bad? In a deeply pipelined architecture (as all are today), failed speculation means the work done by several pipeline stages needs to be thrown away, which in turn propagates a bubble of uselessness through the pipeline (think air bubbles in a garden hose). If these bubbles occur too frequently, the loop produces results at only a fraction of the attainable bandwidth. As the measurements section will show, that one wasted speculation takes away about 30% of the potential speed.

How to improve on this problem? Here’s an idea: instead of making decisions that control the flow of execution, we write the code in a straight-line manner and we incorporate the decisions as integers that guide the data flow by means of carefully chosen array indexing. Be prepared—this will force us to do silly things. For example, instead of doing two conditional writes per iteration, we’ll do exactly two writes per iteration no matter what. If the writes were not needed, we’ll overwrite words in memory with their own value. (Did I mention “silly things”?) To prepare the code for all that, let’s rewrite it as follows:

for (auto read = first + 1; read < last; ++read) {
    auto x = *read;
    if (x < pivot) {
        *read = *first;
        *first = x;
        first += 1; 
    } else {
        *read = x;
        *first = *first;
        first += 0; 
    }
}

Now the two branches of the loop are almost identical save for the data. The code is still correct (albeit odd) because on the else branch it needlessly writes *read over itself and *first over itself. How do we now unify the two branches? Doing so in an efficient manner takes a bit of pondering and experimentation. Conditionally incrementing first is easy because we can always write first += x < pivot. Piece of cake. The two memory writes are more difficult, but the basic idea is to take the difference between pointers and use indexing. Here’s the code. Take a minute to think it over:

for (; read < last; ++read) {
    auto x = *read;
    auto smaller = -int(x < pivot);
    auto delta = smaller & (read - first);
    first[delta] = *first;
    read[-delta] = x;
    first -= smaller;
}

To paraphrase a famous Latin aphorism, explanatio longa, codex brevis est. Short is the code, long is the ‘splanation. The initialization of smaller with -int(x < pivot) looks odd but has a good reason: smaller can serve as both an integral (0 or -1) used with the usual arithmetic and also as a mask that is 0 or 0xFFFFFFFF (i.e. bits set all to 0 or all to 1) used with bitwise operations. We will use that mask to allow or obliterate another integral in the next line that computes delta. If x < pivotsmaller is all ones and delta gets initialized to read - first. Subsequently, delta is used in first[delta] and read[-delta], which really are syntactic sugar for *(first + delta) and *(read - delta), respectively. If we substitute delta in those expressions, we obtain *(first + (read - first)) and *(read - (read - first)), respectively.

The last line, first -= smaller, is trivial: if x < pivot, subtract -1 from first, which is the same as incrementing first. Otherwise, subtract 0 from first, effectively leaving first alone. Nicely done.

With x < pivot substituted to 1, the calculation done in the body of the loop becomes:

auto x = *read;
int smaller = -1;
auto delta = -1 & (read - first);
*(first + (read - first)) = *first;
*(read - (read - first)) = x;
first -= -1;

Kind of magically the two pointer expressions simplify down to *read and *first, so the two assignments effect a swap (recall that x had been just initialized with *read). Exactly what we did in the true branch of the test in the initial version!

If x < pivot is false, delta gets initialized to zero and the loop body works as follows:

auto x = *read;
int smaller = 0;
auto delta = 0 & (read - first);
*(first + 0) = *first;
*(read - 0) = x;
first -= 0;

This time things are simpler: *first gets written over itself, *read also gets written over itself, and first is left alone. The code has no effect whatsoever, which is exactly what we wanted.

Let’s now take a look at the entire function:

long* lomuto_partition_branchfree(long* first, long* last) {
    assert(first <= last);
    if (last - first < 2)
        return first; // nothing interesting to do
    --last;
    if (*first > *last)
        swap(*first, *last);
    auto pivot_pos = first;
    auto pivot = *first;
    do {
        ++first;
        assert(first <= last);
    } while (*first < pivot);
    for (auto read = first + 1; read < last; ++read) {
        auto x = *read;
        auto smaller = -int(x < pivot);
        auto delta = smaller & (read - first);
        first[delta] = *first;
        read[-delta] = x;
        first -= smaller;
    }
    assert(*first >= pivot);
    --first;
    *pivot_pos = *first;
    *first = pivot;
    return first;
}

A beaut, isn’t she? Even more beautiful is the generated code—take a look at clang/ldc and g++/gdc. Again, there is a bit of variation across backends.

Experiments and Results

All code is available at https://github.com/andralex/lomuto.

To draw a fair comparison between the two partitioning schemes, I’ve put together a quicksort implementation. This is because most often a partition would be used during quicksort. For the sake of simplification, the test implementation omits a few details present in industrial quicksort implementations, which need to worry about a variety of data shapes (partially presorted ascending or descending, with local patterns, with many duplicates, etc). Library implementations choose the pivot carefully from a sample of usually 3-9 elements, possibly with randomization, and have means to detect and avoid pathological inputs, most often by using Introsort.

In our benchmark, for simplicity, we only test against random data, and the choice of pivot is simply the minimum of first and last element. This is without loss of generality; better pivot choices and adding introspection are done the same way regardless of the partitioning method. Here, the focus is to compare the performance of Hoare vs. Lomuto vs. branch-free Lomuto.

The charts below plot the time taken by one sorting operation depending on the input size. The machine used is an Intel i7-4790 at 3600 MHz with a 256KB/1MB/8MB cache hierarchy running Ubuntu 20.04. All builds were for maximum speed (-O3, no assertions, no boundcheck for the D language). The input is a pseudorandom permutation of longs with the same seed for all languages and platforms. To eliminate noise, the minimum is taken across several epochs.

The results for the D language are shown below, including the standard library’s std.sort as a baseline.

The results for C++ are shown in the plots below. Again the standard library implementation std::sort is included as a baseline.

One important measurement is the CPU utilization efficiency, shown by Intel VTune as “the micropipe”, a diagram illustrating inefficiencies in resource utilization. VTune’s reports are very detailed but the micropipe gives a quick notion of where the work goes. To interpret a micropipe, think of it as a funnel. The narrower the exit (on the right), the slower the actual execution as a fraction of potential speed.

The micropipes shown below correspond to the Hoare partition, Lomuto partition (in the traditional implementation), and branch-free Lomuto partition. The first two throw away about 30% of all work as bad speculation. In contrast, the Lomuto branchless partition wastes no work on speculation, which allows it a better efficiency in spite of more memory writes.

Intel VTune pipe efficiency diagram for the Hoare partition. A large percentage of work is wasted on failed speculation.

Intel VTune pipe efficiency diagram for the traditional “branchy” Lomuto partition, featuring about as much failed speculation as the Hoare partition.

Intel VTune pipe efficiency diagram for the Lomuto branch-free partition. Virtually no work is wasted on failed speculation, leading to a much better efficiency.

Discussion

The four versions (two languages times two backends) exhibit slight variations due to differences in standard library implementations and backend versions. It is not surprising that minute variations in generated code are liable to create measurable differences in execution speed.

As expected, the “branchy” Lomuto partition compares poorly with Hoare partition, especially for large input sizes. Both are within the same league as the standard library implementation of the sort routine. Sorting using the branchless Lomuto partition, however, is the clear winner by a large margin regardless of platform, backend, and input size.

It has become increasingly clear during the past few years that algorithm analysis—and proposals for improvements derived from it—cannot be done solely with pen and paper using stylized computer architectures with simplistic cost models. The efficiency of sorting is determined by a lot more than counting the comparisons and swaps—at least, it seems, the predictability of comparisons must be taken into account. In the future, I am hopeful that better models of computation will allow researchers to rein in the complexity. For the time being, it seems, algorithm optimization remains hopelessly experimental.

For sorting in particular, Lomuto is definitely back and should be considered by industrial implementations of quicksort on architectures with speculative execution.

Acknowledgments

Many thanks are due to Amr Elmasry, Jyrki Katajainen, and Max Stenmark for an inspirational paper. I haven’t yet been able to engineer a mergesort implementation (the main result of their paper) that beats the best quicksort described here, but I’m working on it. (Sorry, Sorters Anonymous… I’m still off the wagon.) I’d like to thank to Michael Parker and the commentators at the end of this post for fixing many of my non-native-speaker-isms. (Why do they say “pretend not to notice” and “pretend to not notice”? I never remember the right one.) Of course, most of the credit is due to Nico Lomuto, who defined an algorithm that hasn’t just stood the test of time—it passed it.

Introspection, Introspection Everywhere

Prelude: Orem, UT, May 29 2015

Just finished delivering my keynote. Exiting the character, I’m half dead. People say it needs to look easy. Yeah, just get up there and start saying things. Like it’s natural. Spontaneous. Not for me it’s not. Weeks before any public talk, I can only think of how to handle it. What to say. What angles come up. The questions. The little jokes made in real time. Being consistently spontaneous requires so much rehearsal. The irony.

So all I want now is sneak into my hotel room. Replenish the inner introvert after the ultimate act of extroversion. Lay on the bed thinking “What the heck was that?” for the rest of the day. Bit of slalom to get to the door. Almost there. In the hallway, an animated gentleman talks to a few others. He sports a shaved head and a pair of eyebrows that just won’t quit. Stands just by the door, notices me in the corner of his eye, and it’s instantly clear to both of us he’s waiting for me. Still, he delicately gives me the chance to pretend I didn’t notice and walk around him. “Hi, I’m Andrei.” “Liran, co-founder and CTO of Weka.IO. I’m leaving a bit early and before that I wanted to tell you—you should come visit us in Tel Aviv. We’ve been using D for a year now in a large system. I think you’ll like what you see. I might also arrange you a Google tech talk and visits at a couple of universities. Think it over.”

Tel Aviv, May 8 2017

Coming out of the hotel, heat hits like a punch. We’re talking 41 Celsius (before you pull that calculator: 106 Fahrenheit), if you’re lucky to be in the shade. Zohar, software engineer and factotum extraordinaire at Weka, is driving us on the busy streets of Tel Aviv to his employer’s headquarters. A fascinating exotic place, so far away from my neck of the woods.

First, Liran gives me an overview of their system—a large-scale distributed storage based on flash memory. Not my specialty, so I’m desperately searching my mind for trick questions—Information Theory—wait! peeps a lonely neuron who hasn’t fired since 1993—Reed-Solomon and friends, used to know about it. (Loved that class. The professor was quite a character. Wrote anticommunist samizdat poetry before it was cool. Or even legal. True story.) “How do you deal with data corruption when you have such a low write amplification?” “Glad you asked!” (It’s actually me who’s glad. Yay, I didn’t ask a stupid question.) “We use a redundant encoding with error correction properties; you get to choose the trade-off between redundancy and failure tolerance. At any rate, blind data duplication is mathematically a gross thing to do.” I ask a bunch more questions, and clearly these guys did their homework. The numbers look impressive, too—they beat specialized hardware at virtually all metrics. “What’s the trick?” I finally ask. Liran smiles. “I get that all the time. There’s not one trick. There’s a thousand carefully motivated, principled things we do, from the high level math down to the machine code in the drivers. The trick is to do everything right.”

We now need to hop to my first stop of the tour—Tel Aviv University. Liran accompanies me, partly to see his alma mater after twenty years. Small, intimate audience; it’s the regular meeting of their programming languages research group. A few graduate students, a postdoc, and a couple of professors. The talk goes over smoothly. We get to spend five whole minutes on an oddball—what’s the role of the two semicolons in here?

mixin(op ~ "payload;");

It’s tricky. mixin is a statement so it needs a terminator, hence the semicolon at the very end. In turn, mixin takes a string (the concatenation of variable op, which in this case happens to be either "++" or "--", and "payload;") and passes it to the compiler as a statement. Mix this string in the code, so to say. It follows that the string ultimately compiled is "++payload;" or "--payload;". The trick is the generated statement needs its own semicolon. However, within an expression context, mixin is an expression so no more need for the additional semicolon:

auto x = mixin(op ~ "payload"); // mixin is an expression here, no semicolon!

This seems to leave one researcher a bit unhappy, but I point out that all macro systems have their oddities. He agrees and the meeting ends on a cordial note.

The evening ends with dinner and beers with engineers at Weka. The place is called “Truck Deluxe” and it features an actual food truck parked inside the restaurant. We discuss a million things. I am so galvanized I can only sleep for four hours.

May 9 2017

Omg omg OMG. The alarm clock rings at 6:50 AM, and then again at 7 and 7:10, in ever more annoying tones. Just as I’d set it up anticipating the cunning ways of my consciousness to become alert just enough to carefully press the stop—careful, not the snooze—button, before slumbering again. Somewhat to my surprise I make it in time for meeting Liran to depart to Haifa. Technion University is next.

Small meeting again; we start with a handful of folks, but word goes on the grapevine and a few more join during the act. Nice! I pity Liran—he knows the talk by heart by now, including my jokes. I try to make a couple new ones to entertain him a bit, too. It all goes over nicely, and I get great questions. One that keeps on coming is: how do you debug all that compile-time code? To which I quip: “Ever heard of printf-based debugging? Yeah, I thought so. That’s pretty much what you get to do now, in the following form:”

pragma(msg, string_expression);

The expression is evaluated and printed if and only if the compiler actually “goes through” that line, i.e. you can use it to tell which branch in a static if was taken. It is becoming clear, however, that to scale up Design by Introspection more tooling support would be needed.

Back at Weka, as soon as I get parked by a visitor desk, folks simply start showing up to talk with me, one by one or in small groups. Their simple phonetic names—Eyal, Orem, Tomer, Shachar, Maory, Or, …—are a welcome cognitive offload for yours socially inept truly. They suggest improvements to the language, ask me about better ways to do this or that, and show me code that doesn’t work the way it should.

In this faraway place it’s only now, as soon as I see their code, that I feel at home. They get it! These people don’t use the D language like “whatevs”. They don’t even use it as “let’s make this more interesting.” They’re using it as strategic advantage to beat hardware storage companies at their own game, whilst moving unfairly faster than their software storage competitors. The code is as I’d envisioned the D language would be used at its best for high leverage: introspection, introspection everywhere. Compile time everything that can be done at compile time. It’s difficult to find ten lines of code without a static if in there—the magic fork in design space. I run wc -l and comment on the relatively compact code base. They nod approvingly. “We’ve added a bunch of features since last year, yet the code size has stayed within 5%.” I, too, had noticed that highly introspective code has an odd way of feeding upon itself. Ever more behaviors flow through the same lines.

Their questions and comments are intent, focused, so much unlike the stereotypical sterile forum debate. They have no interest to show off, prove me wrong, or win a theoretical argument; all they need is to do good work. It is obvious to all involved that a better language would help them in the way better materials can help architects and builders. Some template-based idioms are awfully slower to compile than others, how can we develop some tooling to inform us about that? Ideas get bounced. Plans emerge. I put a smudgy finger on the screen: “Hmm, I wonder how that’s done in other languages.” The folks around me chuckle. “We couldn’t have done that in any other language.” I decide to take it as a compliment.

A few of us dine at a fancy restaurant (got to sample the local beer—delicious!), where technical discussions go on unabated. These folks are sharp, full of ideas, and intensely enthusiastic. They fully realize my visit there offers the opportunity to shed collective months of toil from their lives and to propel their work faster. I literally haven’t had ten minutes with myself through the day. The only time I get to check email on my phone is literally when I lock myself into (pardon) a restroom stall. Tomorrow my plan is to sleep in, but there’s so much going on, and so much more yet to come, that again I can only sleep a couple of hours.

May 10 2017

Today’s the big day: the Google Campus Tel Aviv talk. We’re looking at over 160 attendees this evening. But before that there’s more talking to people at Weka. Attribute calculus comes on the table. So for example we have @safe, @trusted, and @system with a little algebra: @safe functions can call only @safe and @trusted functions. Any other function may call any function, and inference works on top of everything. That’s how you encapsulate unsafe code in a large system—all nice and dandy. But how about letting users define their own attributes and calculi? For example, a “may switch fibers” attribute such that functions that yield cannot be called naively. Or a “has acquired a lock” attribute. From here to symbolic computation and execution cost estimation the road is short! To be sure, that would complicate the language. But looking at their code it’s clear how it would be mightily helped by such stuff.

I got a lunchtime talk scheduled with all Weka employees in attendance, but I’m relaxed—by this point it’s just a family reunion. Better yet, one in which I get to be the offbeat uncle. My only concern is novelty—everything I preach, these folks have lived for two years already. Shortly before the talk Liran comes to me and asks “Do you think you have a bit more advanced material for us?” Sorry mate, I’m at capacity over here—can’t produce revolutionary material in the next five minutes. Fortunately the talk goes well in the sense that Design by Introspection formalizes and internalizes the many things they’ve already been doing. They appreciate that, and I get a warm reception and a great Q&A session.

As I get to the Google campus in Tel Aviv, late afternoon, the crowds start to gather. This is big! And I feel like I’d kill myself right now! I mentioned I take weeks to prepare a single public appearance, by the end of which I will definitely have burned a few neurons. And I’m fine with that—it’s the way it’s supposed to be. Problem is, this one week packs seven appearances. The hosts offer me coffee before, during, and after the talk. Coffee in Israel is delicious, but I’m getting past the point where caffeine may help.

Before the talk I get to chit chat incognito with a few attendees. One asks whether a celebrity will be speaking. “No, it’s just me.” He’s nice enough to not display disappointment.

There’s something in the air that tells you immediately whether an audience will be welcoming or not so much. Like a smell. And the scent in this room is fabulously festive. These folks are here to have a good time, and all I need to do to keep the party going is, in the words of John Lakos, “show up, babble, and drool.” The talk goes over so well, they aren’t bored even two hours later. Person in the front row seems continuously confused, asks a bunch of questions. Fortunately we get to an understanding before resorting to the dreaded “let’s take this offline.” Best questions come of course from soft-spoken dudes in the last row. As I poke fun at Mozilla’s CheckedInt, I realize Rust is also Mozilla’s and I fear malice by proxy will be alleged. Too late. (Late at night, I double checked. Mozilla’s CheckedInt is just as bad as I remembered. They do a division to test for multiplication overflow. Come on, put a line of assembler in there! Portability is worth a price, just not any price.) My talk ends just in time for me to not die. I’m happy. If you’re exhausted it means it went well.

May 11 2017

Again with the early wake-up, this time to catch a train to Beersheba. Zohar rides with me. The train is busy with folks from all walks of life. I trip over the barrel of a Galil. Its owner—a girl in uniform—exchanges smiles with me.

Two talks are on the roster today, college followed by Ben Gurion University. The first goes well except one detail. It’s so hot and so many people in the room that the AC decides—hey, y’know, I don’t care much for all that Design by Introspection. It’s so hot, folks don’t even bother to protest my increasingly controversial jokes about various languages. I take the risk to give them a coffee break in the middle thus giving them the opportunity to leave discreetly. To my pleasant surprise, they all return to the sauna for part deux.

The AC works great at Ben Gurion University, but here I’m facing a different problem: it’s the dreaded right-after-lunch spot and some people have difficulty staying awake. Somebody gives up and leaves after 20 minutes. Fortunately a handful of enthused (and probably hungry) students and one professor get into it. The professor really likes the possibilities opened by Design by Introspection and loves the whole macro expansion idea. Asks me a bazillion questions after the talk that make it clear he’s been hacking code generation engines for years. Hope to hear back from him.

Just when I think I’m done, there’s one more small event back at Weka. A few entrepreneurs and executives, friends of Weka’s founders, got wind of their successful use of the D language and were curious to have a sit down with me. One researcher, too. I’m well-prepared for a technical discussion: yes, we know how to do safety. We have a solution in the works for applications that don’t want the garbage collector. “So far so good,” told himself the lamb walking into the slaughterhouse.

To my surprise, the concerns these leaders have are entirely nontechnical. It’s all about community building, leadership, availability of libraries and expertise, website, the “first five minutes” experience, and such. These people have clearly have done their homework; they know the main contributors by name along with their roles, and are familiar with the trendy topics and challenges within the D language community.

“Your package distribution system needs ranking,” mentions a CTO to approving nods from the others. “Downloads, stars, activity—all criteria should be available for sorting and filtering. People say github is better than sourceforge because the latter has a bunch of crap. In fact I’m sure github has even more crap than sourceforge, it’s just that you don’t see it because of ranking.”

“Leadership could be better,” mentions a successful serial entrepreneur. “There’s no shortage of great ideas in the community, and engagement should be toward getting work done on those great ideas. Problem is, great ideas are easy to debate against because almost by definition what makes them great is also what takes them off the beaten path. They’re controversial. There’s risk to them. They may even seem impossible. So those get pecked to death. What gets implemented is good ideas, the less controversial ones. But you don’t want to go with the good ideas. You want the great ideas.”

And so it goes for over three hours. My head is buzzing with excitement as Zohar drives us back to the hotel. The ideas. The opportunities. The responsibility. These people at Weka have a lot riding on the D language. It’s not only money—”not that there’s anything wrong with that!”—but also their hopes, dreams, pride of workmanship. The prime of their careers. We’ve all got our work cut out for us.

I tell Zohar no need to wake up early again tomorrow, I’ll just get a cab to the airport. I reckon a pile of backlogged work is waiting for him. He’s relieved. “Man, I’m so glad you said that. I’m totally pooped after this week. I don’t know how you do it.”

Funny thing is, I don’t know either.