Faster Command Line Tools in D

Posted on

Jon Degenhardt is a member of eBay’s search team focusing on recall, ranking, and search engine design. He is also the author of eBay’s TSV utilities, an open source data mining toolkit for delimited files. The toolkit is written in D and is quite fast. Much of its performance derives from approaches like those described here.

This post will show how a few simple D programming constructs can turn an already fast command line tool into one that really screams, and in ways that retain the inherent simplicity of the original program. The techniques used are applicable to many programming problems, not just command line tools. This post describes how these methods work and why they are effective. A simple programming exercise is used to illustrate these optimizations. Applying the optimizations cuts the run-time by more than half.

Task: Aggregate values in a delimited file based on a key

It’s a common programming task: Take a data file with fields separated by a delimiter (comma, tab, etc), and run a mathematical calculation involving several of the fields. Often these programs are one-time use scripts, other times they have longer shelf life. Speed is of course appreciated when the program is used more than a few times on large files.

The specific exercise we’ll explore starts with files having keys in one field, integer values in another. The task is to sum the values for each key and print the key with the largest sum. For example:

```A   4
B   5
B   8
C   9
A   6```

With the first field as key, second field as value, the key with the max sum is `B`, with a total of 13.

Fields are delimited by a TAB, and there may be any number of fields on a line. The file name and field numbers of the key and value are passed as command line arguments. Below is a Python program to do this:

max_column_sum_by_key.py

```#!/usr/bin/env python

import argparse
import fileinput
import collections

def main():
parser = argparse.ArgumentParser(description='Sum a column.')

args = parser.parse_args()
delim = '\t'

max_field_index = max(args.key_field_index, args.value_field_index)
sum_by_key = collections.Counter()

for line in args.file:
fields = line.rstrip('\n').split(delim)
if max_field_index < len(fields):
sum_by_key[fields[args.key_field_index]] += int(fields[args.value_field_index])

max_entry = sum_by_key.most_common(1)
if len(max_entry) == 0:
print 'No entries'
else:
print 'max_key:', max_entry[0][0], 'sum:', max_entry[0][1]

if __name__ == '__main__':
main()```

(Note: For brevity, error handling is largely omitted from programs shown.)

The program follows a familiar paradigm. A dictionary (`collections.Counter`) holds the cumulative sum for each key. The file is read one line at a time, splitting each line into an array of fields. The key and value are extracted. The value field is converted to an integer and added to the cumulative sum for the key. After the program processes all of the lines, it extracts the entry with the largest value from the dictionary.

The D program, first try

It’s a common way to explore a new programming language: write one of these simple programs and see what happens. Here’s a D version of the program, using perhaps the most obvious approach:

max_column_sum_by_key_v1.d

```int main(string[] args)
{
import std.algorithm : max, maxElement;
import std.array : split;
import std.conv : to;
import std.stdio;

if (args.length < 4)
{
writefln ("synopsis: %s filename keyfield valuefield", args[0]);
return 1;
}

string filename = args[1];
size_t keyFieldIndex = args[2].to!size_t;
size_t valueFieldIndex = args[3].to!size_t;
size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
char delim = '\t';

long[string] sumByKey;

foreach(line; filename.File.byLine)
{
auto fields = line.split(delim);
if (maxFieldIndex < fields.length)
{
string key = fields[keyFieldIndex].to!string;
sumByKey[key] += fields[valueFieldIndex].to!long;
}
}

if (sumByKey.length == 0) writeln("No entries");
else
{
auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
}
return 0;
}```

Processing is basically the same as the Python program. An associative array (`long[string] sumByKey`) holds the cumulative sum for each key. Like the Python program, it splits each line into an array of fields, extracts the key and value fields, and updates the cumulative sum. Finally, it retrieves and prints the entry with the maximum value.

We will measure performance using an ngram file from the Google Books project: googlebooks-eng-all-1gram-20120701-0 (`ngrams.tsv` in these runs). This file is 10.5 million lines, 183 MB. Each line has four fields: the ngram, year, total occurrence count, and the number of books the ngram appeared in. Visit the ngram viewer dataset page for more information. The file chosen is for unigrams starting with the digit zero. Here are a few lines from the file:

```0       1898    114067  6140
0       1906    208805  7933
0       1922    204995  9042
0.5     1986    143398  13938
0.5     1999    191449  19262```

The `year` (second column) is used as the key, and the `total occurrence count` (third column) as the value. There are 414 distinct years in the data file.

The LDC compiler is used to build the D programs, as it generates fast code:

`\$ ldc2 -release -O max_column_sum_by_key_v1.d`

Here are the commands to perform the task:

```\$ max_column_sum_by_key.py ngrams.tsv 1 2   # Python program
max_key: 2006 sum: 22569013

\$ max_column_sum_by_key_v1 ngrams.tsv 1 2   # D program
max_key: 2006 sum: 22569013```

(Note: These programs use field numbers starting at zero.)

The `time` command was used to measure performance. e.g. `\$ time max_column_sum_by_key.py ngrams.tsv 1 2`. On the author’s MacBook Pro, the Python version takes 12.6 seconds, the D program takes 3.2 seconds. This makes sense as the D program is compiled to native code. But suppose we run the Python program with PyPy, a just-in-time Python compiler? This gives a result of 2.4 seconds, actually beating the D program, with no changes to the Python code. Kudos to PyPy, this is an impressive result. But we can still do better with our D program.

Second version: Using `splitter`

The first key to improved performance is to switch from using `split` to `splitter`. The `split` function is “eager”, in that it constructs and returns the entire array of fields. Eventually the storage for these fields needs to be deallocated. `splitter` is “lazy”. It operates by returning an `input range` that iterates over the fields one-at-a-time. We can take advantage of that by avoiding constructing the entire array, and instead keeping a single field at a time in a reused local variable. Here is an augmented program that does this, the main change being the introduction of an inner loop iterating over each field:

max_column_sum_by_key_v2.d

```int main(string[] args)
{
import std.algorithm : max, maxElement, splitter;
import std.conv : to;
import std.range : enumerate;
import std.stdio;

if (args.length < 4)
{
writefln ("synopsis: %s filename keyfield valuefield", args[0]);
return 1;
}

string filename = args[1];
size_t keyFieldIndex = args[2].to!size_t;
size_t valueFieldIndex = args[3].to!size_t;
size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
string delim = "\t";

long[string] sumByKey;

foreach(line; filename.File.byLine)
{
string key;
long value;
bool allFound = false;

foreach (i, field; line.splitter(delim).enumerate)
{
if (i == keyFieldIndex) key = field.to!string;
if (i == valueFieldIndex) value = field.to!long;
if (i == maxFieldIndex) allFound = true;
}

if (allFound) sumByKey[key] += value;
}

if (sumByKey.length == 0) writeln("No entries");
else
{
auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
}
return 0;
}```

The modified program is quite a bit faster, running in 1.8 seconds, a 44% improvement. Insight into what changed can be seen by using the `--DRT-gcopt=profile:1` command line option. This turns on garbage collection profiling, shown below (output edited for brevity):

```\$ max_column_sum_by_key_v1 --DRT-gcopt=profile:1 ngrams.tsv 1 2
max_key: 2006 sum: 22569013
Number of collections:  132
Grand total GC time:  246 milliseconds
GC summary:   35 MB,  132 GC  246 ms

\$ max_column_sum_by_key_v2 --DRT-gcopt=profile:1 ngrams.tsv 1 2
max_key: 2006 sum: 22569013
Number of collections:  167
Grand total GC time:  101 milliseconds
GC summary:    5 MB,  167 GC  101 ms```

(Note: The `--DRT-gcopt=profile:1` parameter is invisible to normal option processing.)

The reports show two key differences. One is the ‘max pool memory’, the first value shown on the “GC summary line”. The significantly lower value indicates less memory is being allocated. The other is the total time spent in collections. The improvement, 145ms, only accounts for a small portion of the 1.4 seconds that were shaved off by the second version. However, there are other costs associated with storage allocation. Note that allocating and reclaiming storage has a cost in any memory management system. This is not limited to systems using garbage collection.

Also worth mentioning is the role D’s slices play. When `splitter` returns the next field, it is not returning a copy of characters in the line. Instead, it is returning a “slice”. The data type is a `char[]`, which is effectively a pointer to a location in the input line and a length. No characters have been copied. When the next field is fetched, the variable holding the slice is updated (pointer and length), a faster operation than copying a variable-length array of characters. This is a remarkably good fit for processing delimited files, as identifying the individual fields can be done without copying the input characters.

Third version: The `splitter` / `Appender` combo

Switching to `splitter` was a big speed win, but came with a less convenient programming model. Extracting specific fields while iterating over them is cumbersome, more so as additional fields are needed. Fortunately, the simplicity of random access arrays can be reclaimed by using an `Appender`. Here is a revised program:

max_column_sum_by_key_v3.d

```int main(string[] args)
{
import std.algorithm : max, maxElement, splitter;
import std.array : appender;
import std.conv : to;
import std.stdio;

if (args.length < 4)
{
writefln ("synopsis: %s filename keyfield valuefield", args[0]);
return 1;
}

string filename = args[1];
size_t keyFieldIndex = args[2].to!size_t;
size_t valueFieldIndex = args[3].to!size_t;
size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
string delim = "\t";

long[string] sumByKey;
auto fields = appender!(char[][])();

foreach(line; filename.File.byLine)
{
fields.clear;
fields.put(line.splitter(delim));
if (maxFieldIndex < fields.data.length)
{
string key = fields.data[keyFieldIndex].to!string;
sumByKey[key] += fields.data[valueFieldIndex].to!long;
}
}

if (sumByKey.length == 0) writeln("No entries");
else
{
auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
}
return 0;
}```

The `Appender` instance in this program works by keeping a growable array of `char[]` slices. The lines:

```    fields.clear;
fields.put(line.splitter(delim));```

at the top of the `foreach` loop do the work. The statement `fields.put(line.splitter(delim))` iterates over each field, one at a time, appending each slice to the array. This will allocate storage on the first input line. On subsequent lines, the `fields.clear` statement comes into play. It clears data from the underlying data store, but does not deallocate it. Appending starts again at position zero, but reusing the storage allocated on the first input line. This regains the simplicity of indexing a materialized array. GC profiling shows no change from the previous version of the program.

Copying additional slices does incur a performance penalty. The resulting program takes 2.0 seconds, versus 1.8 for the previous version. This is still a quite significant improvement over the original program (down from 3.2 seconds, 37% faster), and represents a good compromise for many programs.

Fourth version: Associative Array (AA) lookup optimization

The `splitter` / `Appender` combo gave significant performance improvement while retaining the simplicity of the original code. However, the program can still be faster. GC profiling indicates storage is still being allocated and reclaimed. The source of the allocations is the following two lines in the inner loop:

```    string key = fields.data[keyFieldIndex].to!string;
sumByKey[key] += fields.data[valueFieldIndex].to!long;```

The first line converts `fields.data.[keyFieldIndex]`, a `char[]`, to a `string`. The `string` type is immutable, `char[]` is not, forcing the conversion to make a copy. This is both necessary and required by the associative array. The characters in the `fields.data` buffer are valid only while the current line is processed. They will be overwritten when the next line is read. Therefore, the characters forming the key need to be copied when added to the associative array. The associative array enforces this by requiring immutable keys.

While it is necessary to store the key as an immutable value, it is not necessary to use immutable values to retrieve existing entries. This creates the opportunity for an improvement: only copy the key when creating the initial entry. Here’s a change to the same lines that does this:

```    char[] key = fields.data[keyFieldIndex];
long fieldValue = fields.data[valueFieldIndex].to!long;

if (auto sumValuePtr = key in sumByKey) *sumValuePtr += fieldValue;
else sumByKey[key.to!string] = fieldValue;```

The expression `key in sumByKey` returns a pointer to the value in the hash table, or `null` if the key was not found. If an entry was found, it is updated directly, without copying the key. Updating via the returned pointer avoids a second associative array lookup. A new string is allocated for a key only the first time it is seen.

The updated program runs in 1.4 seconds, an improvement of 0.6 seconds (30%). GC profiling reflects the change:

```\$ ./max_column_sum_by_key_v4 --DRT-gcopt=profile:1 ngrams.tsv 1 2
max_key: 2006 sum: 22569013
Number of collections:  2
Grand total GC time:  0 milliseconds
GC summary:    5 MB,    2 GC    0 ms```

This indicates that unnecessary storage allocation has been eliminated from the main loop.

Note: The program will still allocate and reclaim storage as part of rehashing the associative array. This shows up on GC reports when the number of unique keys is larger.

Early termination of the field iteration loop

The optimizations described so far work by reducing unnecessary storage allocation. These are by far the most beneficial optimizations discussed in this document. Another small but obvious enhancement would be to break out of the field iteration loops after all needed fields have been processed. In version 2, using `splitter`, the inner loop becomes:

```    foreach (i, field; line.splitter(delim).enumerate)
{
if (i == keyFieldIndex) key = field.to!string;
if (i == valueFieldIndex) value = field.to!long;
if (i == maxFieldIndex)
{
allFound = true;
break;
}
}```

This produced a 0.1 second improvement. A small gain, but will be larger in use cases excluding a larger number of fields.

The same optimization can be applied to the `splitter` / `Appender` combo. The D standard library provides a convenient way to do this: the `take` method. It returns an input range with at most N elements, effectively short circuiting traversal. The change is to the `fields.put(line.splitter(delim))` line:

```    import std.range : take;
...
fields.put(line.splitter(delim).take(maxFieldIndex + 1));```

Putting it all together

The final version of our program is below, adding `take` for early field iteration termination to version 4 (`splitter`, `Appender`, associative array optimization). For a bit more speed, drop `Appender` and use the manual field iteration loop shown in version two (version 5 in the results table at the end of this article).

max_column_sum_by_key_v4b.d

```int main(string[] args)
{
import std.algorithm : max, maxElement, splitter;
import std.array : appender;
import std.conv : to;
import std.range : take;
import std.stdio;

if (args.length < 4)
{
writefln ("synopsis: %s filename keyfield valuefield", args[0]);
return 1;
}

string filename = args[1];
size_t keyFieldIndex = args[2].to!size_t;
size_t valueFieldIndex = args[3].to!size_t;
size_t maxFieldIndex = max(keyFieldIndex, valueFieldIndex);
string delim = "\t";

long[string] sumByKey;
auto fields = appender!(char[][])();

foreach(line; filename.File.byLine)
{
fields.clear;
fields.put(line.splitter(delim).take(maxFieldIndex + 1));
if (maxFieldIndex < fields.data.length)
{
char[] key = fields.data[keyFieldIndex];
long fieldValue = fields.data[valueFieldIndex].to!long;

if (auto sumValuePtr = key in sumByKey) *sumValuePtr += fieldValue;
else sumByKey[key.to!string] = fieldValue;
}
}

if (sumByKey.length == 0) writeln("No entries");
else
{
auto maxEntry = sumByKey.byKeyValue.maxElement!"a.value";
writeln("max_key: ", maxEntry.key, " sum: ", maxEntry.value);
}
return 0;
}```

Summary

This exercise demonstrates several straightforward ways to speed up command line programs. The common theme: avoid unnecessary storage allocation and data copies. The results are dramatic, more than doubling the speed of an already quick program. They are also a reminder of the crucial role memory plays in high performance applications.

Of course, these themes apply to many applications, not just command line tools. They are hardly specific to the D programming language. However, several of D’s features proved especially well suited to minimizing both storage allocation and data copies. This includes ranges, dynamic arrays, and slices, which are related concepts, and lazy algorithms, which operate on them. All were used in the programming exercise.

The table below compares the running times of each of the programs tested:

Program What Time(sec)
Python Program Run with Python2 12.6
Python Program Run with PyPy 2.4
D version 1 Using `split` 3.2
D version 2 Replace `split` with `splitter` 1.8
D version 3 `splitter`/`Appender` combo 2.0
D version 4 `splitter`/`Appender`, AA optimization 1.4
D version 4b Version 4 plus `take` 1.3
D version 5 `splitter`, AA optimization, loop exit 1.1

The author thanks Ali Çehreli, Steven Schveighoffer, and Steve Schneider for providing valuable input to this article.

21 thoughts on “Faster Command Line Tools in D”

1. time (zcat googlebooks-eng-all-1gram-20120701-0.gz | awk ‘ { sum[\$2]+=\$3 } END { for(x in sum) { print x,sum[x] }} ‘ | sort -n -k 2 | tail -n 1)
2006 22569013

real 0m4.320s
user 0m5.476s
sys 0m0.156s

1. Martin Cohen says:

My thoughts exactly!

You could also use the gawk builtin sort functions, although I always have to use the man page when I use them.

Just awesome.

2. Andrea says:

I think you have a spurious semicolon in line 24 of the Phyton example.

1. Jon Degenhardt says:

Good catch, thanks Andrea!

3. David says:

very nice 🙂 !
Just a question of a newbie: Lets assume I have the code of max_column_sum_by_key_v1.d at hand and the compiler option –DRT-gcopt=profile:1 indicates that the GC is “heavily” used. However, I do not know in which parts or even lines the GC is used. So I am wondering if there is a (profiling) tool to figure out which part/lines of the code trigger the GC?

1. Michael Parker says:

You can pass -vgc to get a list of where potential GC allocations may occur. This is shown in the second post of the GC series on this blog.

1. Jon Degenhardt says:

Thanks David!

Yes, the -vgc approach Michael suggests is what I typically do. Then checking documentation of the specific library functions involved is usually enough to identify causes.

–Jon

4. David says:

Many thanks Michael and Jon for your fast and helpful replies! -vgc is handy indeed 🙂 – nice stuff!

1. Jon Degenhardt says:

Hi Denis,

Thanks for the reading. The code you posted is addressing a more narrow set of use cases than intended in the blog post. The more narrow case is certainly legitimate and interesting, but it is different and not comparable. In particular, the blog post assumes arbitrary strings as keys, as illustrated in the example file shown in the task description. Your code makes two assumptions that are more narrow. First, that keys are integers, and second, that the range of integer values is known in advance and is small enough to fit into an array. These are very significant differences.

–Jon

1. Denis says:

I was misguided, sorry.
Here is the new “full” version: https://github.com/sirkon/snippets/blob/master/main.cpp

Couple of 3rd party libraries are required now:
2) boost::program_options

I am using ubuntu and it is easy here:

sudo apt install libsparsehash-dev libboost-program-options-dev

Can be a little more troubling on mac and even more so on windows.

Compilation:
g++ -Wall -std=c++1z -O3 main.cpp -lboost_program_options

It takes about 0.3s to complete on my machine.

1. Denis says:

Parallelization on the single log level barely have any sense:
1) You can’t split compressed stream as easily.
2) You rather parallelize via parallel parsing of different files than trying to parse a single one in parallel.

In other words, all technics used here are from real world practice. Parallelization on parsing of the single source is totally artificial.

2. Jon Degenhardt says:

Hi Mike,

It’s really quite cool that my post inspired all the follow-ups. I hadn’t seen the Kotlin post, thanks for bringing it to my attention. The Kotlin post is rather nice in that it shows a step-by-step approach to analyzing performance bottlenecks and optimizing accordingly, something that can benefit programmers beyond this one example.

I don’t plan to update the blog post to introduce new techniques. There are of course ways to speed up the program further. Indeed, D makes it easy write the same low-level code one can write in C, with equivalent performance. However, employing such low-level techniques would not fit the objective of the post, which was to gain efficiency while retaining simplicity.

Of course, parallelization is an important technique for high level programs also. I can understand your suggestion to add it to the post. However, I believe there are much better use cases for illustrating the techniques involved. My experience is that parallelization is often not a good choice for command line tools. An example where the benefits don’t have such qualifications would be better.

–Jon

1. Denis says:

I thought one would use Clickhouse/Vertica/Spark for simplicity in these kind of tasks.

Specifically in this case upload into Clickhouse took about 4 seconds (about 2.5 millions inserts per second), where I mapped columns into

1st -> name
2nd -> event
3rd -> count
4rd -> weight

and then simple query

SELECT argMax(event, val), max(val) FROM (
SELECT event, sum(count) AS val FROM google_tsv GROUP BY event
);

completed in 0.09 seconds.

This site uses Akismet to reduce spam. Learn how your comment data is processed.