Faster Command Line Tools in D

Posted on

Jon Degenhardt is a member of eBay’s search team focusing on recall, ranking, and search engine design. He is also the author of eBay’s TSV utilities, an open source data mining toolkit for delimited files. The toolkit is written in D and is quite fast. Much of its performance derives from approaches like those described here.


This post will show how a few simple D programming constructs can turn an already fast command line tool into one that really screams, and in ways that retain the inherent simplicity of the original program. The techniques used are applicable to many programming problems, not just command line tools. This post describes how these methods work and why they are effective. A simple programming exercise is used to illustrate these optimizations. Applying the optimizations cuts the run-time by more than half.

Task: Aggregate values in a delimited file based on a key

It’s a common programming task: Take a data file with fields separated by a delimiter (comma, tab, etc), and run a mathematical calculation involving several of the fields. Often these programs are one-time use scripts, other times they have longer shelf life. Speed is of course appreciated when the program is used more than a few times on large files.

The specific exercise we’ll explore starts with files having keys in one field, integer values in another. The task is to sum the values for each key and print the key with the largest sum. For example:

A   4
B   5
B   8
C   9
A   6

With the first field as key, second field as value, the key with the max sum is B, with a total of 13.

Fields are delimited by a TAB, and there may be any number of fields on a line. The file name and field numbers of the key and value are passed as command line arguments. Below is a Python program to do this:

max_column_sum_by_key.py

#!/usr/bin/env python

import argparse
import fileinput
import collections

def main():
    parser = argparse.ArgumentParser(description='Sum a column.')
    parser.add_argument('file', type=open)
    parser.add_argument('key_field_index', type=int)
    parser.add_argument('value_field_index', type=int)

    args = parser.parse_args()
    delim = '\t'

    max_field_index = max(args.key_field_index, args.value_field_index)
    sum_by_key = collections.Counter()

    for line in args.file:
        fields = line.rstrip('\n').split(delim)
        if max_field_index < len(fields):
            sum_by_key[fields[args.key_field_index]] += int(fields[args.value_field_index])

    max_entry = sum_by_key.most_common(1)
    if len(max_entry) == 0:
        print 'No entries'
    else:
        print 'max_key:', max_entry[0][0], 'sum:', max_entry[0][1]

if __name__ == '__main__':
    main()

(Note: For brevity, error handling is largely omitted from programs shown.)

The program follows a familiar paradigm. A dictionary (collections.Counter) holds the cumulative sum for each key. The file is read one line at a time, splitting each line into an array of fields. The key and value are extracted. The value field is converted to an integer and added to the cumulative sum for the key. After the program processes all of the lines, it extracts the entry with the largest value from the dictionary.

The D program, first try

It’s a common way to explore a new programming language: write one of these simple programs and see what happens. Here’s a D version of the program, using perhaps the most obvious approach:

max_column_sum_by_key_v1.d

int main(string[] args)
{
    import std.algorithm : max, maxElement;
    import std.array : split;
    import std.conv : to;
    import std.stdio;

    if (args.length < 4)
    {
        writefln ("synopsis: %s filename keyfield valuefield", args[0]);
        return 1;
    }

    string filename = args[1];
    size_t keyFieldIndex = args[2].to!size_t;
    size_t valueFieldIndex = args[3].to!size_t;
    size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
    char delim = '\t';

    long[string] sumByKey;

    foreach(line; filename.File.byLine)
    {
        auto fields = line.split(delim);
        if (maxFieldIndex < fields.length)
        {
            string key = fields[keyFieldIndex].to!string;
            sumByKey[key] += fields[valueFieldIndex].to!long;
        }
    }

    if (sumByKey.length == 0) writeln("No entries");
    else
    {
        auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
        writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
    }
    return 0;
}

Processing is basically the same as the Python program. An associative array (long[string] sumByKey) holds the cumulative sum for each key. Like the Python program, it splits each line into an array of fields, extracts the key and value fields, and updates the cumulative sum. Finally, it retrieves and prints the entry with the maximum value.

We will measure performance using an ngram file from the Google Books project: googlebooks-eng-all-1gram-20120701-0 (ngrams.tsv in these runs). This file is 10.5 million lines, 183 MB. Each line has four fields: the ngram, year, total occurrence count, and the number of books the ngram appeared in. Visit the ngram viewer dataset page for more information. The file chosen is for unigrams starting with the digit zero. Here are a few lines from the file:

0       1898    114067  6140
0       1906    208805  7933
0       1922    204995  9042
0.5     1986    143398  13938
0.5     1999    191449  19262

The year (second column) is used as the key, and the total occurrence count (third column) as the value. There are 414 distinct years in the data file.

The LDC compiler is used to build the D programs, as it generates fast code:

$ ldc2 -release -O max_column_sum_by_key_v1.d

Here are the commands to perform the task:

$ max_column_sum_by_key.py ngrams.tsv 1 2   # Python program
max_key: 2006 sum: 22569013

$ max_column_sum_by_key_v1 ngrams.tsv 1 2   # D program
max_key: 2006 sum: 22569013

(Note: These programs use field numbers starting at zero.)

The time command was used to measure performance. e.g. $ time max_column_sum_by_key.py ngrams.tsv 1 2. On the author’s MacBook Pro, the Python version takes 12.6 seconds, the D program takes 3.2 seconds. This makes sense as the D program is compiled to native code. But suppose we run the Python program with PyPy, a just-in-time Python compiler? This gives a result of 2.4 seconds, actually beating the D program, with no changes to the Python code. Kudos to PyPy, this is an impressive result. But we can still do better with our D program.

Second version: Using splitter

The first key to improved performance is to switch from using split to splitter. The split function is “eager”, in that it constructs and returns the entire array of fields. Eventually the storage for these fields needs to be deallocated. splitter is “lazy”. It operates by returning an input range that iterates over the fields one-at-a-time. We can take advantage of that by avoiding constructing the entire array, and instead keeping a single field at a time in a reused local variable. Here is an augmented program that does this, the main change being the introduction of an inner loop iterating over each field:

max_column_sum_by_key_v2.d

int main(string[] args)
{
    import std.algorithm : max, maxElement, splitter;
    import std.conv : to;
    import std.range : enumerate;
    import std.stdio;

    if (args.length < 4)
    {
        writefln ("synopsis: %s filename keyfield valuefield", args[0]);
        return 1;
    }

    string filename = args[1];
    size_t keyFieldIndex = args[2].to!size_t;
    size_t valueFieldIndex = args[3].to!size_t;
    size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
    string delim = "\t";

    long[string] sumByKey;

    foreach(line; filename.File.byLine)
    {
        string key;
        long value;
        bool allFound = false;

        foreach (i, field; line.splitter(delim).enumerate)
        {
            if (i == keyFieldIndex) key = field.to!string;
            if (i == valueFieldIndex) value = field.to!long;
            if (i == maxFieldIndex) allFound = true;
        }

        if (allFound) sumByKey[key] += value;
    }

    if (sumByKey.length == 0) writeln("No entries");
    else
    {
        auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
        writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
    }
    return 0;
}

The modified program is quite a bit faster, running in 1.8 seconds, a 44% improvement. Insight into what changed can be seen by using the --DRT-gcopt=profile:1 command line option. This turns on garbage collection profiling, shown below (output edited for brevity):

$ max_column_sum_by_key_v1 --DRT-gcopt=profile:1 ngrams.tsv 1 2
max_key: 2006 sum: 22569013
        Number of collections:  132
        Grand total GC time:  246 milliseconds
GC summary:   35 MB,  132 GC  246 ms

$ max_column_sum_by_key_v2 --DRT-gcopt=profile:1 ngrams.tsv 1 2
max_key: 2006 sum: 22569013
      Number of collections:  167
      Grand total GC time:  101 milliseconds
GC summary:    5 MB,  167 GC  101 ms

(Note: The --DRT-gcopt=profile:1 parameter is invisible to normal option processing.)

The reports show two key differences. One is the ‘max pool memory’, the first value shown on the “GC summary line”. The significantly lower value indicates less memory is being allocated. The other is the total time spent in collections. The improvement, 145ms, only accounts for a small portion of the 1.4 seconds that were shaved off by the second version. However, there are other costs associated with storage allocation. Note that allocating and reclaiming storage has a cost in any memory management system. This is not limited to systems using garbage collection.

Also worth mentioning is the role D’s slices play. When splitter returns the next field, it is not returning a copy of characters in the line. Instead, it is returning a “slice”. The data type is a char[], which is effectively a pointer to a location in the input line and a length. No characters have been copied. When the next field is fetched, the variable holding the slice is updated (pointer and length), a faster operation than copying a variable-length array of characters. This is a remarkably good fit for processing delimited files, as identifying the individual fields can be done without copying the input characters.

Third version: The splitter / Appender combo

Switching to splitter was a big speed win, but came with a less convenient programming model. Extracting specific fields while iterating over them is cumbersome, more so as additional fields are needed. Fortunately, the simplicity of random access arrays can be reclaimed by using an Appender. Here is a revised program:

max_column_sum_by_key_v3.d

int main(string[] args)
{
    import std.algorithm : max, maxElement, splitter;
    import std.array : appender;
    import std.conv : to;
    import std.stdio;

    if (args.length < 4)
    {
        writefln ("synopsis: %s filename keyfield valuefield", args[0]);
        return 1;
    }

    string filename = args[1];
    size_t keyFieldIndex = args[2].to!size_t;
    size_t valueFieldIndex = args[3].to!size_t;
    size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
    string delim = "\t";

    long[string] sumByKey;
    auto fields = appender!(char[][])();

    foreach(line; filename.File.byLine)
    {
        fields.clear;
        fields.put(line.splitter(delim));
        if (maxFieldIndex < fields.data.length)
        {
            string key = fields.data[keyFieldIndex].to!string;
            sumByKey[key] += fields.data[valueFieldIndex].to!long;
        }
    }

    if (sumByKey.length == 0) writeln("No entries");
    else
    {
        auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
        writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
    }
    return 0;
}

The Appender instance in this program works by keeping a growable array of char[] slices. The lines:

    fields.clear;
    fields.put(line.splitter(delim));

at the top of the foreach loop do the work. The statement fields.put(line.splitter(delim)) iterates over each field, one at a time, appending each slice to the array. This will allocate storage on the first input line. On subsequent lines, the fields.clear statement comes into play. It clears data from the underlying data store, but does not deallocate it. Appending starts again at position zero, but reusing the storage allocated on the first input line. This regains the simplicity of indexing a materialized array. GC profiling shows no change from the previous version of the program.

Copying additional slices does incur a performance penalty. The resulting program takes 2.0 seconds, versus 1.8 for the previous version. This is still a quite significant improvement over the original program (down from 3.2 seconds, 37% faster), and represents a good compromise for many programs.

Fourth version: Associative Array (AA) lookup optimization

The splitter / Appender combo gave significant performance improvement while retaining the simplicity of the original code. However, the program can still be faster. GC profiling indicates storage is still being allocated and reclaimed. The source of the allocations is the following two lines in the inner loop:

    string key = fields.data[keyFieldIndex].to!string;
    sumByKey[key] += fields.data[valueFieldIndex].to!long;

The first line converts fields.data.[keyFieldIndex], a char[], to a string. The string type is immutable, char[] is not, forcing the conversion to make a copy. This is both necessary and required by the associative array. The characters in the fields.data buffer are valid only while the current line is processed. They will be overwritten when the next line is read. Therefore, the characters forming the key need to be copied when added to the associative array. The associative array enforces this by requiring immutable keys.

While it is necessary to store the key as an immutable value, it is not necessary to use immutable values to retrieve existing entries. This creates the opportunity for an improvement: only copy the key when creating the initial entry. Here’s a change to the same lines that does this:

    char[] key = fields.data[keyFieldIndex];
    long fieldValue = fields.data[valueFieldIndex].to!long;

    if (auto sumValuePtr = key in sumByKey) *sumValuePtr += fieldValue;
    else sumByKey[key.to!string] = fieldValue;

The expression key in sumByKey returns a pointer to the value in the hash table, or null if the key was not found. If an entry was found, it is updated directly, without copying the key. Updating via the returned pointer avoids a second associative array lookup. A new string is allocated for a key only the first time it is seen.

The updated program runs in 1.4 seconds, an improvement of 0.6 seconds (30%). GC profiling reflects the change:

$ ./max_column_sum_by_key_v4 --DRT-gcopt=profile:1 ngrams.tsv 1 2
max_key: 2006 sum: 22569013
        Number of collections:  2
        Grand total GC time:  0 milliseconds
GC summary:    5 MB,    2 GC    0 ms

This indicates that unnecessary storage allocation has been eliminated from the main loop.

Note: The program will still allocate and reclaim storage as part of rehashing the associative array. This shows up on GC reports when the number of unique keys is larger.

Early termination of the field iteration loop

The optimizations described so far work by reducing unnecessary storage allocation. These are by far the most beneficial optimizations discussed in this document. Another small but obvious enhancement would be to break out of the field iteration loops after all needed fields have been processed. In version 2, using splitter, the inner loop becomes:

    foreach (i, field; line.splitter(delim).enumerate)
    {
        if (i == keyFieldIndex) key = field.to!string;
        if (i == valueFieldIndex) value = field.to!long;
        if (i == maxFieldIndex)
        {
            allFound = true;
            break;
        }
    }

This produced a 0.1 second improvement. A small gain, but will be larger in use cases excluding a larger number of fields.

The same optimization can be applied to the splitter / Appender combo. The D standard library provides a convenient way to do this: the take method. It returns an input range with at most N elements, effectively short circuiting traversal. The change is to the fields.put(line.splitter(delim)) line:

    import std.range : take;
    ...
    fields.put(line.splitter(delim).take(maxFieldIndex + 1));

Putting it all together

The final version of our program is below, adding take for early field iteration termination to version 4 (splitter, Appender, associative array optimization). For a bit more speed, drop Appender and use the manual field iteration loop shown in version two (version 5 in the results table at the end of this article).

max_column_sum_by_key_v4b.d

int main(string[] args)
{
    import std.algorithm : max, maxElement, splitter;
    import std.array : appender;
    import std.conv : to;
    import std.range : take;
    import std.stdio;

    if (args.length < 4)
    {
        writefln ("synopsis: %s filename keyfield valuefield", args[0]);
        return 1;
    }

    string filename = args[1];
    size_t keyFieldIndex = args[2].to!size_t;
    size_t valueFieldIndex = args[3].to!size_t;
    size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
    string delim = "\t";

    long[string] sumByKey;
    auto fields = appender!(char[][])();

    foreach(line; filename.File.byLine)
    {
        fields.clear;
        fields.put(line.splitter(delim).take(maxFieldIndex + 1));
        if (maxFieldIndex < fields.data.length)
        {
            char[] key = fields.data[keyFieldIndex];
            long fieldValue = fields.data[valueFieldIndex].to!long;

            if (auto sumValuePtr = key in sumByKey) *sumValuePtr += fieldValue;
            else sumByKey[key.to!string] = fieldValue;
        }
    }

    if (sumByKey.length == 0) writeln("No entries");
    else
    {
        auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
        writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
    }
    return 0;
}

Summary

This exercise demonstrates several straightforward ways to speed up command line programs. The common theme: avoid unnecessary storage allocation and data copies. The results are dramatic, more than doubling the speed of an already quick program. They are also a reminder of the crucial role memory plays in high performance applications.

Of course, these themes apply to many applications, not just command line tools. They are hardly specific to the D programming language. However, several of D’s features proved especially well suited to minimizing both storage allocation and data copies. This includes ranges, dynamic arrays, and slices, which are related concepts, and lazy algorithms, which operate on them. All were used in the programming exercise.

The table below compares the running times of each of the programs tested:

Program What Time(sec)
Python Program Run with Python2 12.6
Python Program Run with PyPy 2.4
D version 1 Using split 3.2
D version 2 Replace split with splitter 1.8
D version 3 splitter/Appender combo 2.0
D version 4 splitter/Appender, AA optimization 1.4
D version 4b Version 4 plus take 1.3
D version 5 splitter, AA optimization, loop exit 1.1

The author thanks Ali Çehreli, Steven Schveighoffer, and Steve Schneider for providing valuable input to this article.

Serialization in D

Posted on

Vladimir Panteleev has spent over a decade using and contributing to D. He is the creator and maintainer of DFeed, the software powering the D forums, has made numerous contributions to Phobos, DRuntime, DMD, and the D website, and has created several tools useful for maintaining D software (like Digger and Dustmite).


A few days ago, I saw this blog post by Justin Turpin on the front page of Hacker News:

The Grass is Always Greener – My Struggles with Rust

This was an interesting coincidence in that it occurred during DConf, where I had mentioned serialization in D a few times during my talk. Naturally, I was curious to see how D stands up to this challenge.

The Task

Justin’s blog starts off with the following Python code:

import configparser
config = ConfigParser()
config.read("config.conf")

This is actually very similar to a pattern I use in many of my D programs. For example, DFeed (the software behind forum.dlang.org), has this code for configuring its built-in web server:

struct ListenConfig
{
    string addr;
    ushort port = 80;
}

struct Config
{
    ListenConfig listen;
    string staticDomain = null;
    bool indexable = false;
}
const Config config;

import ae.utils.sini;
shared static this() { config = loadIni!Config("config/web.ini"); }

This is certainly more code than the Python example, but that’s only the case because I declare the configuration as a D type. The loadIni function then accepts the type as a template parameter and returns an instance of it. The strong typing makes it easier to catch typos and other mistakes in the configuration – an unknown field or a non-numeric value where a number is expected will immediately result in an error.

On the last line, the configuration is saved to a global by a static constructor (shared indicates it runs once during program initialization, instead of once per thread). Even though loadIni‘s return type is mutable, D allows the implicit conversion to const because, as it occurs in a static constructor, it is treated as an initialization.

Traits

The Rust code from Justin’s blog is as follows:

#[macro_use]
extern crate serde_derive;
extern crate toml;

#[derive(Deserialize)]
struct MyConfiguration {
  jenkins_host: String,
  jenkins_username: String,
  jenkins_token: String
}

fn gimme_config(some_filename: &str) -> MyConfiguration {
  let mut file = File::open(some_filename).unwrap();
  let mut s = String::new();
  file.read_to_string(&mut s).unwrap();
  let my_config: MyConfiguration = toml::from_str(s).unwrap();
  my_config
}

The first thing that jumps out to me is that the MyConfiguration struct is annotated with #[derive(Deserialize)]. It doesn’t seem optional, either – quoting Justin:

This was something that actually really discouraged me upon learning, but you cannot implement a trait for an object that you did not also create. That’s a significant limitation, and I thought that one of the main reason Rust decided to go with Traits and Structs instead of standard classes and inheritance was for this very reason. This limitation is also relevant when you’re trying to serialize and deserialize objects for external crates, like a MySQL row.

D allows introspecting the fields and methods of any type at compile-time, so serializing third-party types is not an issue. For example (and I’ll borrow a slide from my DConf talk), deserializing one struct field from JSON looks something like this:

string jsonField = parseJsonString(s);
enforce(s.skipOver(":"), ": expected");

bool found;
foreach (i, ref field; v.tupleof)
{
    enum name = __traits(identifier, v.tupleof[i]);
    if (name == jsonField)
    {
        field = jsonParse!(typeof(field))(s);
        found = true;
        break;
    }
}
enforce(found, "Unknown field " ~ jsonField);

Because the foreach aggregate is a tuple (v.tupleof is a tuple of v‘s fields), the loop will be unrolled at compile time. Then, all that’s left to do is compare each struct field with the field name we got from the JSON stream and, if it matches, read it in. This is a minimal example that can be improved e.g. by replacing the if statements with a switch, which allows the compiler to optimize the string comparisons to hash lookups.

That’s not to say D lacks means for adding functionality to existing types. Although D does not have struct inheritance like C++ or struct traits like Rust, it does have:

  • alias this, which makes wrapping types trivial;
  • opDispatch, allowing flexible customization of forwarding;
  • template mixins, which allow easily injecting functionality into your types;
  • finally, there is of course classic OOP inheritance if you use classes.

Ad-lib and Error Handling

It doesn’t always make sense to deserialize to a concrete type, such as when we only know or care about a small part of the schema. D’s standard JSON module, std.json, currently only allows deserializing to a tree of variant-like types (essentially a DOM). For example:

auto config = readText("config.json").parseJSON;
string jenkinsServer = config["jenkins_server"].str;

The code above is the D equivalent of the code erickt posted on Hacker News:

let config: Value = serde::from_reader(file)
    .expect("config has invalid json");

let jenkins_server = config.get("jenkins_server")
    .expect("jenkins_server key not in config")
    .as_str()
    .expect("jenkins_server key is not a string");

As D generally uses exceptions for error handling, the checks that must be done explicitly in the Rust example are taken care of by the JSON library.

Final thoughts

In the discussion thread for Justin’s post, Reddit user SilverWingedSeraph writes:

You’re comparing a systems language to a scripting language. Things are harder in systems programming because you have more control over, in this case, the memory representation of data. This means there is more friction because you have to specify that information.

This struck me as a false dichotomy. There is no reason why a programming language which has the necessary traits to be classifiable as a system programming language can not also provide the convenience of scripting languages to the extent that it makes sense to do so. For example, D provides type inference and variant types for when you don’t care about strong typing, and garbage collection for when you don’t care about object lifetime, but also provides the tools to get down to the bare metal in the parts of the code where performance matters.

For my personal projects, I’ve greatly enjoyed D’s capability of allowing rapidly prototyping a design, then optimizing the performance-critical parts as needed without having to use a different language to do so.

See also