Project Highlight: The New CTFE Engine

Posted on

CTFE (Compile-Time Function Execution) is today a core feature of the D Programming Language. D creator Walter Bright first implemented it in DMD as an extension of the constant folding logic that was already there. Don Clugston (of FastDelegate fame) made a pass at improving it and, according to Walter, “took it much further“. Since that time, usage of CTFE has shown up in one D project after another, including in D’s standard library. For example, Dmitry Olshansky employed it in his overhaul of std.regex to great effect.

On the last day of DConf 2016, Stefan Koch gave a lightning talk on his thoughts about CTFE in D. At the end of the talk, in response to a question from Andrei Alexandrescu on how D’s implementation could be improved, he said the following:

CTFE is really a hack. You can see that it’s a hack. It’s implemented as a hack. It is the most useful hack that I’ve ever seen, and it is definitely a hacker’s tool to do stuff that are like magic. But to be fast, it would need to be heavily redesigned, reimplemented, possibly executed in multiple threads, because it is used for stuff that we could never have envisioned when it was invented.

Not long after that, Stefan opened a discussion on the fourms and took up the torch to improve the CTFE engine. As to why he got started on this journey in the first place, Stefan says, “I started work on the CTFE engine because I said so at DConf.” But, of course, there’s more to it than that.

I have pretty heavy-weight CTFE needs (I worked on a compile-time trans-compiler). Also my CTFE SQLite reader is failing if you want to read a database bigger then 2MB at ctfe.

His investigations into the performance of the CTFE interpreter shed light on its problems.

The current interpreter interprets every AST-Node it sees directly. This leaves very little space to collect information about the code that is being interpreted. It doesn’t know when something will be used as a reference, so it needs to copy every variable on every mutation. It has to do a deep-copy for this. That means it copies the whole chain of mutations every time.

To clarify, he offers the following example.

Imagine foreach(i;0 .. 10) { a = i; }. On the first iteration we save a` = 0 and set a`` to 1. On the second iteration we save a``` = 1 and a````= 0 and we set a````` to 2 , then a`````` = 1 and a``````` = 0 and so on. As you can see, the memory requirements just shoot up. It’s basically a factorial function with a very small coefficient. That is why for very small workloads this extreme overhead is not noticeable.

That flaw looked unfixable. Indeed the whole architecture in dinterpret.d is very convoluted and hard to understand. I did a few experiments on improving memory-management of the interpreter but it proved fruitless.

Once he realized there was going to be no quick fix, Stefan sat down and drew up a plan to avoid digging himself into the same hole the current interpreter was in. The result of his planning led him down a road he hadn’t expected to travel.

Direct Interpretation was out of the question since it would give the new engine too little time to analyze data-flow and decided whether a copy was really needed or not. I had to implement an Intermediate Representation. It had to be portable to different evaluation back-ends. I ended up with a solution, inspired by OpenGL, of defining my interface in the form of function calls an evaluation back end had to implement. That meant I would not be able to simply modify the current interpreter. This made the start very steep, but it is a decision I do not regret.

His implementation consists of a front end and a back end.

The front end walks the AST and issues calls to the back end. And the back end transforms those calls into actual bytecode. This bytecode is interperted by the back end as soon as the front end requires it.

In terms of functionality, he likens the current implementation to an immediate mode graphics API, and his revamp to retained mode. In this case, though, it’s the immediate mode that’s the memory hog.

You can read about his progress in the CTFE Status thread, where he has been posting frequent updates. His updates include problems he encounters, features he implements, and performance statistics. Eventually, every compiler that uses the DMD front end will benefit from his improvements.

Big Performance Improvement for std.regex

Posted on

Dmitry Olshansky has been a frequent contributor to the D programming language. Perhaps his best known work is his overhaul of the std.regex module, which he architected as part of Google Summer of Code 2011. In this post, he describes an algorithmic optimization he implemented this past summer that resulted in a big performance win.

Optimizing std.regex has been my favorite pastime, but it has gotten harder over the years. It eventually became clear that micro-optimizing the engine’s state copy routine, or trying to avoid that extra write, wasn’t going to cut it anymore. To move further, I needed a new algorithmic improvement. This is how the so-called “Bit-NFA” came to be implemented. Developed in May of this year, it has come a long way to land in the main repository.

Before going into details, a short overview of the engine is called for. A user-specified pattern given to the engine first goes through a compilation process, where it gets transformed into a bytecode program, along with a bunch of lookup tables and auxiliary data-structures. Bytecode implies a VM, not unlike, say, the Java VM, but far simpler and more specific. In fact, there are two of them in std.regex, one that evaluates execution of threads in a backtracking manner and another one which evaluates all threads in lock-step, resolving any duplicates along the way.

Now, running a full blown VM, even a tiny one, on each character of input doesn’t sound all that high-performance. That’s why there is an extra trick, a kickstart engine (it should probably be called a “sidekick engine”), which is a dumb approximation of the full engine. It is run over the input first. When it spots something that looks like a match, the full engine is run to check it. The only requirement is that it can have no false negatives. That is, it has to detect as positive all matches of the regex pattern. This kickstart engine is the central piece of today’s post.

Historically, I intended for there to be a lot of different kickstart engines, ranging from a simple ‘memchr the first byte of the pattern’ to a Boyer-Moore search on the prefix of a pattern. But during the gory days of GSOC 2011, a simple solution came first and out-shadowed all others: the Shift Or algorithm.

Basically, it is an NFA (Nondeterministic Finite Automation), where each state is a bit in a word. Shifting this word advances all the states. Masking removes those that don’t match the current character. Importantly, shifting also places 0 as the first bit, indicating the active state.

With these two insights, the whole process of searching for a string becomes shifting + OR-masking the bits. The last point is checking for the successful match – one of the bits is in the finish state.

Looking at this marvelous construction, it’s tempting to try and overcome its limitation – the straight-forward execution of states. So let’s introduce some control flow by denoting some bits as jumps. To carry out a jump, we just need to map every combination of jump bits to the mask of the resulting positions. A basic hashmap could serve us well in this regard. Then the cycle becomes:

  1. Shift the word
  2. Capture control flow bits
  3. Lookup control flow table
  4. Mask AND with control flow bits
  5. Check for finish state(s) bits
  6. Mask OR with match filter table

In the end, we execute the whole engine with nothing more than a hash-map lookup, a table lookup, and a bit of bitwise operations. This is the essence of what I call the Bit-NFA engine.

Of course, there are some tricky bits, such as properly mapping bytecodes to bits. Then there comes Unicode.. oh gosh. The trick to Unicode, though, is having a fast path for ASCII < 0x80 and the rest. For ASCII, we just go with a simple table. Unicode is a two-staged variation of it. Two-staging the table let’s us coalesce identical pages, saving space for the whole 21 bits of the code point range.

Overall, the picture really is worth a thousand words. Here is how the new kickstart engine stacks up.

This now only leaves us to optimize the VMs further. The proven technique is JIT-ing the bytecode and is what top engines are doing. Still, I’m glad there are notable tricks to speed up regex execution in general without pulling out this heavy handed weapon.