Report a bug
If you spot a problem with this page, click here to create a Bugzilla issue.
Improve this page
Quickly fork, edit online, and submit a pull request for this page. Requires a signed-in GitHub account. This works well for small changes. If you'd like to make larger changes you may want to consider using a local clone.

Regular Expressions

by Dmitry Olshansky, the author of std.regex

Introduction

String processing is a daily routine that most applications have to deal with in a one way or another. It should come as no surprise that many programming languages have standard libraries equipped with a variety of specialized functions for common string manipulation needs. The D programming language standard library among others offers a nice assortment in std.string, as well as generic functions from std.algorithm that work with strings. Still no amount of fixed functionality could cover all needs, as naturally flexible text data needs flexible solutions.

This is where regular expressions, often succinctly called regexes, come in handy. Regexes are simple yet powerful language for defining patterns for sets of strings. Combined with pattern matching, data extraction and substitution, they form a Swiss Army knife of text processing. They are considered so important that a number of programming languages provide built-in support for regular expressions. Being built-in however does not necessary imply faster processing or having more features. It's just a matter of providing convenient and friendly syntax for typical operations, and integrating it well.

The D programming language provides a standard library module std.regex. Being a highly expressive systems language, D allows regexes to be implemented efficiently within the language itself, yet have good level of readability and usability. And there a few things a pure D implementation brings to the table that are completely unbelievable in a traditional compiled language, more on that at the end of article.

By the end of article you'll have a good understanding of regular expression capabilities in this library, and how to utilize its API in a most straightforward and efficient way. Examples in this article assume that the reader has fair understanding of regex elements, but it's not required.

A warm up

How do you check if something is a phone number by looking at it?

Yes, it's something with numbers, and there may be a country code in front of that... Sticking to an international format should make it more strict. As this is the first time, let's put together a full program:

+1 555 123 4567
import std.stdio, std.regex;

void main()
{
    string phone = readln();  // assume phone is passed as the first line of stdin
    if (matchFirst(phone, r"^\+[1-9][0-9]* [0-9 ]*$"))
        writeln("It looks like a phone number.");
    else
        writeln("Nope, it's not a phone number.");
}

And that's it! Let us however keep in mind the boundaries of regular expressions power - to truly establish a validness of a phone number, one has to try dialing it or contact the authority.

Let's drill down into this tiny example because it actually showcases a lot of interesting things:

Continuing with the phone number example, it would be useful to get the exact value of the country code, as well as the whole number. For the sake of experiment let's also explicitly obtain compiled regex pattern via regex to see how it works.

import std.regex;

string phone = "+31 650 903 7158";  // fictional, any coincidence is just that
auto phoneReg = regex(r"^\+([1-9][0-9]*) [0-9 ]*$");
auto m = matchFirst(phone, phoneReg);
assert(m); // also boolean context - test for non-empty
assert(!m.empty); // same as the line above
assert(m[0] == "+31 650 903 7158");
assert(m[1] == "31");
// you shouldn't need the regex object type all too often
// but here it is for the curious
static assert(is(typeof(phoneReg) : Regex!char));

Search and replace

While getting first match is a common theme in string validation, another frequent need is to extract all matches found in a piece of text. Picking an easy task, let's see how to filter out all white space-only lines. There is no special routine for looping over input like search() or similar as found in some libraries. Instead std.regex provides a natural syntax for looping via plain foreach.

auto buffer = std.file.readText("regex.d");
foreach (m; matchAll(buffer, regex(r"^.*[^\p{WhiteSpace}]+.*$","m")))
{
    writeln(m.hit);  // hit is an alias for m[0]
}

It may look and feel like a built-in but it just follows the common conventions to do that. In this case matchAll returns an object that follows the right "protocol" of an input range simply by having the right set of methods. An input range is a lot like an iterator found in other languages. Likewise the result of matchFirst and each element of matchAll is a random access range, a thing that behaves like a "view" of an array.

import std.regex;

auto m = matchAll("Ranges are hot!", r"(\w)\w+(\w)");  // at least 3 "word" symbols
assert(m.front[0] == "Ranges"); // front - first of input range
// m.captures is a historical alias for the first element of match range (.front).
assert(m.captures[1] == m.front[1]);
auto sub = m.front;
assert(sub[2] == "s");
foreach (item; sub)
    writeln(item);  // will show lines: Ranges, R, s

By playing by the rules std.regex gets some nice benefits in interaction with other modules e.g. this is how one could count non-empty lines in a text buffer:

import std.algorithm, std.file, std.regex;

auto buffer = std.file.readText(r"std\typecons.d");
int count = count(matchAll(buffer, regex(r"^.*\P{WhiteSpace}+.*$", "m")));

A seasoned regex user catches instantly that Unicode properties are supported with perl-style \p{xxx}, to spice that all of Scripts and Blocks are supported as well. Let us duly note that \P{xxx} means not having an xxx property, i.e. here not a white space character. Unicode is a vital subject to know, and it won't suffice to try to cover it here. For details see the accessible std.uni documentation and level 1 of conformance as per Unicode standard UTS 18.

Another thing of importance is the option string - "m", where m stands for multi-line mode. Historically utilities that supported regex patterns (unix grep, sed, etc.) processed text line by line. At that time anchors like ^, $ meant the start of the whole input buffer that has been same as that of the line. As regular expressions got more ubiquitous the need to recognize multiple lines in a chunk of text became apparent. In such a mode with anchors ^ & $ were defined to match before and after line break literally. For the curious, modern (Unicode) line break is defined as (\n | \v | \r | \f | \u0085 | \u2028 | \u2029 | \r\n). Needless to say, one need not turn on multi-line mode if not using any of ^, $.

Replacing

Now that search was covered, let's do some replacements. For instance, to replace all dates with "MM/dd/YYYY" format in a text to a sortable format of "YYYY-MM-dd":

import std.regex;

auto text = "2/31/1998";
auto replaced = replaceAll(text, r"([0-9]{1,2})/([0-9]{1,2})/([0-9]{4})".regex, "$3-$1-$2");
assert(replaced == "1998-2-31");

r"pattern".regex is just another notation of writing regex("pattern") called UFCS that some may find more slick. As can be seen the replacement is controlled by a format string not unlike one in C's famous printf. The $1, $2, $3 are substituted with the content of sub-expressions. Aside from referencing sub-matches, one can include the whole part of input preceding the match via $` and $' for the content following right after the match.

Now let's aim for something bigger, this time to show that std.regex can do things that are unattainable by classic textual substitution alone. Imagine you want to translate a web shop catalog so that it displays prices in your currency. Yes, one can use calculator or estimate it in his/her head, given the current ratio. Being programmers we can do better, so let's wrap up a simple program that converts text to use correct prices everywhere. For example, UK pounds to US dollars.

Bottles for £1, or £3 for 4.
import std.conv, std.regex, std.range, std.stdio;
import std.string : format;

void main()
{
    immutable ratio = 1.5824;  // UK pounds to US dollar as of this writing
    auto toDollars(Captures!string price)
    {
        real value = to!real(price["integer"]);
        if (!price["fraction"].empty)
            value += 0.01 * to!real(price["fraction"]);
        return format("$%.2f", value * ratio);
    }
    string text = readln();
    auto converted = replaceAll!toDollars(text,
        regex(r"£\s*(?P<integer>[0-9]+)(\.(?P<fraction>[0-9]{2}))?", "g"));
    write(converted);
}

Getting current conversion rates and supporting more currencies is left as an exercise for the reader. What's at work here is called replace with delegate, which is analogous to a callout ability found in other languages and regex libraries. The magic is simple: whenever replaceAll finds a match it calls a user supplied callback toDollars on the captured piece, then it uses the return value to make the replacement.

And I just couldn't resist spicing that example up with yet another feature - named groups. The syntax for a named group is (?P<name>regex). Names work just like aliases for numbers of captured subexpressions, meaning that with the exact same regular expression one could just as well change the affected lines to:

real value = to!real(price[1]);
if (!price[3].empty)
    value += 0.01*to!real(price[3]);

Though that lacks readability and is not as future proof.

Also note that optional captures are still represented, it's just they can be an empty string if not matched.

Split it up

Now core functionality has been presented, let's move on to some extras. Sometimes it's useful to do almost the opposite of searching - split up input using a regex as separator. The following sample outputs text by sentences:

This is sample text! It has several sentences. Did it work?!
import std.regex, std.stdio;

auto text = readln('\0'); // read all stdin
foreach (sentence; splitter(text, regex(r"(?<=[.?!]+(?![?!]))\s*")))
    writeln(sentence);

Again the type splitter returns is a range, thus allowing foreach iteration. Notice the usage of lookaround in the regex, it's a neat trick here as stripping off final punctuation is not our intention. (?=regex) is lookahead and (?<=regex) is lookbehind.

Breaking down this example, the (?<=[.?!]) part looks behind for first ., ? or !. This gets us halfway to our goal because \s* also matches between elements of punctuation like "?!", so a negative lookahead (?!regex) is introduced inside a lookbehind to make sure there are no more punctuation marks.

Admittedly, the barrage of ? and ! makes this regex rather obscure, more then it actually is. Observe that there are no restrictions on the contents of lookaround expressions, one can go for lookahead inside lookbehind and so on. However in general it's recommended to use them sparingly, keeping them as the weapon of last resort.

Static regex

Let's stop going for features and start thinking performance. D has something to offer here. For one, there is an ability to precompile a constant regex at compile-time:

import std.regex;

static r = regex("Boo-hoo");
assert(match("What was that? Boo-hoo?", r));

Importantly it's the exact same Regex object that works through all of the API we've seen so far. It takes next to no time to initialize, just copy over a ready-made structure from the data segment - roughly 1 μs of run-time to initialize on my machine. Conversely, compiling the regex at run-time took around 10-20 μs - and keep in mind that it was a trivial pattern.

Now stepping further in this direction there is an ability to construct specialized D code to match a given regex instead of using the default run-time engine. Isn't it so often the case that code starts with regular expressions only to be later re-written to heaps of scary-looking code to match speed requirements? No need to rewrite - we got you covered.

Recalling the phone number example:

import std.regex;

string phone = "+31 650 903 7158";
// It's only a 5 characters difference!
auto phoneReg = ctRegex!r"^\+([1-9][0-9]*) [0-9 ]*$";
auto m = match(phone, phoneReg);
assert(m);
assert(m.captures[0] == "+31 650 903 7158");
assert(m.captures[1] == "31");

Interestingly, the code looks almost exactly the same (a total of 5 letters changed), yet it does all of the hard work, generating equivalent D code for the regex pattern. ctRegex produces a StaticRegex object which supports the same API as the runtime Regex. This is the key point - the API stays consistent, yet gets us the significant speed up we sought after. This fosters quick development speed using the regex engine, and then, if desired, a decent runtime speed by switching to ctRegex in the release build (at the cost of slower builds).

In this particular case it matched roughly 50% faster for me though I haven't done comprehensive analysis of this case. That being said, there is no doubt ctRegex is going to improve immensely over time. We only scratched the surface of exciting new possibilities. More data on real-world patterns, performance of CT-regex and other benchmarks is here.

Conclusion

The article represents a walk-through of std.regex focused on showcasing the API. By following a series of easy yet meaningful tasks, its features were exposed in combination, that underline the elegance and flexibility of this library solution. The good thing is not only that the API is natural, but it also follows established standards and integrates well with the rest of Phobos. Putting its major features together, std.regex is:

The format of this article is intentionally more of an overview, it doesn't stop to talk in-depth about certain capabilities like case-insensitive matching (simple casefolding rules of Unicode), backreferences, lazy quantifiers. And even more features are coming to add more expressive power and reach greater performance.