Improve this page Quickly fork, edit online, and submit a pull request for this page. Requires a signed-in GitHub account. This works well for small changes. If you'd like to make larger changes you may want to consider using local clone. Page wiki View or edit the community-maintained wiki page associated with this page.

std.uni

Overview

The std.uni module provides an implementation of fundamental Unicode algorithms and data structures. This doesn't include UTF encoding and decoding primitives, see std.utf.decode and std.utf.encode in std.utf for this functionality.

All primitives listed operate on Unicode characters and sets of characters. For functions which operate on ASCII characters and ignore Unicode characters, see std.ascii. For definitions of Unicode character, code point and other terms used throughout this module see the terminology section below.

The focus of this module is the core needs of developing Unicode-aware applications. To that effect it provides the following optimized primitives:

It's recognized that an application may need further enhancements and extensions, such as less commonly known algorithms, or tailoring existing ones for region specific needs. To help users with building any extra functionality beyond the core primitives, the module provides:

Synopsis

import std.uni;
void main()
{
    // initialize code point sets using script/block or property name
    // now 'set' contains code points from both scripts.
    auto set = unicode("Cyrillic") | unicode("Armenian");
    // same thing but simpler and checked at compile-time
    auto ascii = unicode.ASCII;
    auto currency = unicode.Currency_Symbol;

    // easy set ops
    auto a = set & ascii;
    assert(a.empty); // as it has no intersection with ascii
    a = set | ascii;
    auto b = currency - a; // subtract all ASCII, Cyrillic and Armenian

    // some properties of code point sets
    assert(b.length > 45); // 46 items in Unicode 6.1, even more in 6.2
    // testing presence of a code point in a set
    // is just fine, it is O(logN)
    assert(!b['$']);
    assert(!b['\u058F']); // Armenian dram sign
    assert(b['¥']);

    // building fast lookup tables, these guarantee O(1) complexity
    // 1-level Trie lookup table essentially a huge bit-set ~262Kb
    auto oneTrie = toTrie!1(b);
    // 2-level far more compact but typically slightly slower
    auto twoTrie = toTrie!2(b);
    // 3-level even smaller, and a bit slower yet
    auto threeTrie = toTrie!3(b);
    assert(oneTrie['£']);
    assert(twoTrie['£']);
    assert(threeTrie['£']);

    // build the trie with the most sensible trie level
    // and bind it as a functor
    auto cyrillicOrArmenian = toDelegate(set);
    auto balance = find!(cyrillicOrArmenian)("Hello ընկեր!");
    assert(balance == "ընկեր!");
    // compatible with bool delegate(dchar)
    bool delegate(dchar) bindIt = cyrillicOrArmenian;

    // Normalization
    string s = "Plain ascii (and not only), is always normalized!";
    assert(s is normalize(s));// is the same string

    string nonS = "A\u0308ffin"; // A ligature
    auto nS = normalize(nonS); // to NFC, the W3C endorsed standard
    assert(nS == "Äffin");
    assert(nS != nonS);
    string composed = "Äffin";

    assert(normalize!NFD(composed) == "A\u0308ffin");
    // to NFKD, compatibility decomposition useful for fuzzy matching/searching
    assert(normalize!NFKD("2¹⁰") == "210");
}

Terminology

The following is a list of important Unicode notions and definitions. Any conventions used specifically in this module alone are marked as such. The descriptions are based on the formal definition as found in chapter three of The Unicode Standard Core Specification.

Abstract character
A unit of information used for the organization, control, or representation of textual data. Note that:

Canonical decomposition
The decomposition of a character or character sequence that results from recursively applying the canonical mappings found in the Unicode Character Database and these described in Conjoining Jamo Behavior (section 12 of Unicode Conformance).

Canonical composition
The precise definition of the Canonical composition is the algorithm as specified in Unicode Conformance section 11. Informally it's the process that does the reverse of the canonical decomposition with the addition of certain rules that e.g. prevent legacy characters from appearing in the composed result.

Canonical equivalent
Two character sequences are said to be canonical equivalents if their full canonical decompositions are identical.

Character
Typically differs by context. For the purpose of this documentation the term character implies encoded character, that is, a code point having an assigned abstract character (a symbolic meaning).

Code point
Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF (hex). Not all code points are assigned to encoded characters.

Code unit
The minimal bit combination that can represent a unit of encoded text for processing or interchange. Depending on the encoding this could be: 8-bit code units in the UTF-8 (char), 16-bit code units in the UTF-16 (wchar), and 32-bit code units in the UTF-32 (dchar). Note that in UTF-32, a code unit is a code point and is represented by the D dchar type.

Combining character
A character with the General Category of Combining Mark(M).

Combining class
A numerical value used by the Unicode Canonical Ordering Algorithm to determine which sequences of combining marks are to be considered canonically equivalent and which are not.

Compatibility decomposition
The decomposition of a character or character sequence that results from recursively applying both the compatibility mappings and the canonical mappings found in the Unicode Character Database, and those described in Conjoining Jamo Behavior no characters can be further decomposed.

Compatibility equivalent
Two character sequences are said to be compatibility equivalents if their full compatibility decompositions are identical.

Encoded character
An association (or mapping) between an abstract character and a code point.

Glyph
The actual, concrete image of a glyph representation having been rasterized or otherwise imaged onto some display surface.

Grapheme base
A character with the property Grapheme_Base, or any standard Korean syllable block.

Grapheme cluster
Defined as the text between grapheme boundaries as specified by Unicode Standard Annex #29, Unicode text segmentation. Important general properties of a grapheme:

This module defines a number of primitives that work with graphemes: Grapheme, decodeGrapheme and graphemeStride. All of them are using extended grapheme boundaries as defined in the aforementioned standard annex.

Nonspacing mark
A combining character with the General Category of Nonspacing Mark (Mn) or Enclosing Mark (Me).

Spacing mark
A combining character that is not a nonspacing mark.

Normalization

The concepts of canonical equivalent or compatibility equivalent characters in the Unicode Standard make it necessary to have a full, formal definition of equivalence for Unicode strings. String equivalence is determined by a process called normalization, whereby strings are converted into forms which are compared directly for identity. This is the primary goal of the normalization process, see the function normalize to convert into any of the four defined forms.

A very important attribute of the Unicode Normalization Forms is that they must remain stable between versions of the Unicode Standard. A Unicode string normalized to a particular Unicode Normalization Form in one version of the standard is guaranteed to remain in that Normalization Form for implementations of future versions of the standard.

The Unicode Standard specifies four normalization forms. Informally, two of these forms are defined by maximal decomposition of equivalent sequences, and two of these forms are defined by maximal composition of equivalent sequences.

The choice of the normalization form depends on the particular use case. NFC is the best form for general text, since it's more compatible with strings converted from legacy encodings. NFKC is the preferred form for identifiers, especially where there are security concerns. NFD and NFKD are the most useful for internal processing.

Construction of lookup tables

The Unicode standard describes a set of algorithms that depend on having the ability to quickly look up various properties of a code point. Given the the codespace of about 1 million code points, it is not a trivial task to provide a space-efficient solution for the multitude of properties.

Common approaches such as hash-tables or binary search over sorted code point intervals (as in InversionList) are insufficient. Hash-tables have enormous memory footprint and binary search over intervals is not fast enough for some heavy-duty algorithms.

The recommended solution (see Unicode Implementation Guidelines) is using multi-stage tables that are an implementation of the Trie data structure with integer keys and a fixed number of stages. For the remainder of the section this will be called a fixed trie. The following describes a particular implementation that is aimed for the speed of access at the expense of ideal size savings.

Taking a 2-level Trie as an example the principle of operation is as follows. Split the number of bits in a key (code point, 21 bits) into 2 components (e.g. 15 and 8). The first is the number of bits in the index of the trie and the other is number of bits in each page of the trie. The layout of the trie is then an array of size 2^^bits-of-index followed an array of memory chunks of size 2^^bits-of-page/bits-per-element.

The number of pages is variable (but not less then 1) unlike the number of entries in the index. The slots of the index all have to contain a number of a page that is present. The lookup is then just a couple of operations - slice the upper bits, lookup an index for these, take a page at this index and use the lower bits as an offset within this page. Assuming that pages are laid out consequently in one array at pages, the pseudo-code is:

auto elemsPerPage = (2 ^^ bits_per_page) / Value.sizeOfInBits;
pages[index[n >> bits_per_page]][n & (elemsPerPage - 1)];

Where if elemsPerPage is a power of 2 the whole process is a handful of simple instructions and 2 array reads. Subsequent levels of the trie are introduced by recursing on this notion - the index array is treated as values. The number of bits in index is then again split into 2 parts, with pages over 'current-index' and the new 'upper-index'.

For completeness a level 1 trie is simply an array. The current implementation takes advantage of bit-packing values when the range is known to be limited in advance (such as bool). See also BitPacked for enforcing it manually. The major size advantage however comes from the fact that multiple identical pages on every level are merged by construction.

The process of constructing a trie is more involved and is hidden from the user in a form of the convenience functions codepointTrie, codepointSetTrie and the even more convenient toTrie. In general a set or built-in AA with dchar type can be turned into a trie. The trie object in this module is read-only (immutable); it's effectively frozen after construction.

Unicode properties

This is a full list of Unicode properties accessible through unicode with specific helpers per category nested within. Consult the CLDR utility when in doubt about the contents of a particular set.

General category sets listed below are only accessible with the unicode shorthand accessor.

General category
Abb. Long form Abb. Long formAbb. Long form
L Letter Cn Unassigned Po Other_Punctuation
Ll Lowercase_Letter Co Private_Use Ps Open_Punctuation
Lm Modifier_Letter Cs Surrogate S Symbol
Lo Other_Letter N Number Sc Currency_Symbol
Lt Titlecase_Letter Nd Decimal_Number Sk Modifier_Symbol
Lu Uppercase_Letter Nl Letter_Number Sm Math_Symbol
M Mark No Other_Number So Other_Symbol
Mc Spacing_Mark P Punctuation Z Separator
Me Enclosing_Mark Pc Connector_Punctuation Zl Line_Separator
Mn Nonspacing_Mark Pd Dash_Punctuation Zp Paragraph_Separator
C Other Pe Close_Punctuation Zs Space_Separator
Cc Control Pf Final_Punctuation - Any
Cf Format Pi Initial_Punctuation - ASCII

Sets for other commonly useful properties that are accessible with unicode:

Common binary properties
Name Name Name
Alphabetic Ideographic Other_Uppercase
ASCII_Hex_Digit IDS_Binary_Operator Pattern_Syntax
Bidi_Control ID_Start Pattern_White_Space
Cased IDS_Trinary_Operator Quotation_Mark
Case_Ignorable Join_Control Radical
Dash Logical_Order_Exception Soft_Dotted
Default_Ignorable_Code_Point Lowercase STerm
Deprecated Math Terminal_Punctuation
Diacritic Noncharacter_Code_Point Unified_Ideograph
Extender Other_Alphabetic Uppercase
Grapheme_Base Other_Default_Ignorable_Code_Point Variation_Selector
Grapheme_Extend Other_Grapheme_Extend White_Space
Grapheme_Link Other_ID_Continue XID_Continue
Hex_Digit Other_ID_Start XID_Start
Hyphen Other_Lowercase
ID_Continue Other_Math

Bellow is the table with block names accepted by unicode.block. Note that the shorthand version unicode requires "In" to be prepended to the names of blocks so as to disambiguate scripts and blocks.

Blocks
Aegean Numbers Ethiopic Extended Mongolian
Alchemical Symbols Ethiopic Extended-A Musical Symbols
Alphabetic Presentation Forms Ethiopic Supplement Myanmar
Ancient Greek Musical Notation General Punctuation Myanmar Extended-A
Ancient Greek Numbers Geometric Shapes New Tai Lue
Ancient Symbols Georgian NKo
Arabic Georgian Supplement Number Forms
Arabic Extended-A Glagolitic Ogham
Arabic Mathematical Alphabetic Symbols Gothic Ol Chiki
Arabic Presentation Forms-A Greek and Coptic Old Italic
Arabic Presentation Forms-B Greek Extended Old Persian
Arabic Supplement Gujarati Old South Arabian
Armenian Gurmukhi Old Turkic
Arrows Halfwidth and Fullwidth Forms Optical Character Recognition
Avestan Hangul Compatibility Jamo Oriya
Balinese Hangul Jamo Osmanya
Bamum Hangul Jamo Extended-A Phags-pa
Bamum Supplement Hangul Jamo Extended-B Phaistos Disc
Basic Latin Hangul Syllables Phoenician
Batak Hanunoo Phonetic Extensions
Bengali Hebrew Phonetic Extensions Supplement
Block Elements High Private Use Surrogates Playing Cards
Bopomofo High Surrogates Private Use Area
Bopomofo Extended Hiragana Rejang
Box Drawing Ideographic Description Characters Rumi Numeral Symbols
Brahmi Imperial Aramaic Runic
Braille Patterns Inscriptional Pahlavi Samaritan
Buginese Inscriptional Parthian Saurashtra
Buhid IPA Extensions Sharada
Byzantine Musical Symbols Javanese Shavian
Carian Kaithi Sinhala
Chakma Kana Supplement Small Form Variants
Cham Kanbun Sora Sompeng
Cherokee Kangxi Radicals Spacing Modifier Letters
CJK Compatibility Kannada Specials
CJK Compatibility Forms Katakana Sundanese
CJK Compatibility Ideographs Katakana Phonetic Extensions Sundanese Supplement
CJK Compatibility Ideographs Supplement Kayah Li Superscripts and Subscripts
CJK Radicals Supplement Kharoshthi Supplemental Arrows-A
CJK Strokes Khmer Supplemental Arrows-B
CJK Symbols and Punctuation Khmer Symbols Supplemental Mathematical Operators
CJK Unified Ideographs Lao Supplemental Punctuation
CJK Unified Ideographs Extension A Latin-1 Supplement Supplementary Private Use Area-A
CJK Unified Ideographs Extension B Latin Extended-A Supplementary Private Use Area-B
CJK Unified Ideographs Extension C Latin Extended Additional Syloti Nagri
CJK Unified Ideographs Extension D Latin Extended-B Syriac
Combining Diacritical Marks Latin Extended-C Tagalog
Combining Diacritical Marks for Symbols Latin Extended-D Tagbanwa
Combining Diacritical Marks Supplement Lepcha Tags
Combining Half Marks Letterlike Symbols Tai Le
Common Indic Number Forms Limbu Tai Tham
Control Pictures Linear B Ideograms Tai Viet
Coptic Linear B Syllabary Tai Xuan Jing Symbols
Counting Rod Numerals Lisu Takri
Cuneiform Low Surrogates Tamil
Cuneiform Numbers and Punctuation Lycian Telugu
Currency Symbols Lydian Thaana
Cypriot Syllabary Mahjong Tiles Thai
Cyrillic Malayalam Tibetan
Cyrillic Extended-A Mandaic Tifinagh
Cyrillic Extended-B Mathematical Alphanumeric Symbols Transport And Map Symbols
Cyrillic Supplement Mathematical Operators Ugaritic
Deseret Meetei Mayek Unified Canadian Aboriginal Syllabics
Devanagari Meetei Mayek Extensions Unified Canadian Aboriginal Syllabics Extended
Devanagari Extended Meroitic Cursive Vai
Dingbats Meroitic Hieroglyphs Variation Selectors
Domino Tiles Miao Variation Selectors Supplement
Egyptian Hieroglyphs Miscellaneous Mathematical Symbols-A Vedic Extensions
Emoticons Miscellaneous Mathematical Symbols-B Vertical Forms
Enclosed Alphanumerics Miscellaneous Symbols Yijing Hexagram Symbols
Enclosed Alphanumeric Supplement Miscellaneous Symbols and Arrows Yi Radicals
Enclosed CJK Letters and Months Miscellaneous Symbols And Pictographs Yi Syllables
Enclosed Ideographic Supplement Miscellaneous Technical
Ethiopic Modifier Tone Letters

Bellow is the table with script names accepted by unicode.script and by the shorthand version unicode:

Scripts
Arabic Hanunoo Old_Italic
Armenian Hebrew Old_Persian
Avestan Hiragana Old_South_Arabian
Balinese Imperial_Aramaic Old_Turkic
Bamum Inherited Oriya
Batak Inscriptional_Pahlavi Osmanya
Bengali Inscriptional_Parthian Phags_Pa
Bopomofo Javanese Phoenician
Brahmi Kaithi Rejang
Braille Kannada Runic
Buginese Katakana Samaritan
Buhid Kayah_Li Saurashtra
Canadian_Aboriginal Kharoshthi Sharada
Carian Khmer Shavian
Chakma Lao Sinhala
Cham Latin Sora_Sompeng
Cherokee Lepcha Sundanese
Common Limbu Syloti_Nagri
Coptic Linear_B Syriac
Cuneiform Lisu Tagalog
Cypriot Lycian Tagbanwa
Cyrillic Lydian Tai_Le
Deseret Malayalam Tai_Tham
Devanagari Mandaic Tai_Viet
Egyptian_Hieroglyphs Meetei_Mayek Takri
Ethiopic Meroitic_Cursive Tamil
Georgian Meroitic_Hieroglyphs Telugu
Glagolitic Miao Thaana
Gothic Mongolian Thai
Greek Myanmar Tibetan
Gujarati New_Tai_Lue Tifinagh
Gurmukhi Nko Ugaritic
Han Ogham Vai
Hangul Ol_Chiki Yi

Bellow is the table of names accepted by unicode.hangulSyllableType.

Hangul syllable type
Abb. Long form
L Leading_Jamo
LV LV_Syllable
LVT LVT_Syllable
T Trailing_Jamo
V Vowel_Jamo

References:
ASCII Table, Wikipedia, The Unicode Consortium, Unicode normalization forms, Unicode text segmentation Unicode Implementation Guidelines Unicode Conformance

Trademarks:
Unicode(tm) is a trademark of Unicode, Inc.

License:
Boost License 1.0.

Authors:
Dmitry Olshansky

Source:
std/uni.d

Standards:
Unicode v6.2

dchar lineSep;
Constant code point (0x2028) - line separator.

dchar paraSep;
Constant code point (0x2029) - paragraph separator.

template isCodepointSet(T)
Tests if T is some kind a set of code points. Intended for template constraints.

template isIntegralPair(T, V = uint)
Tests if T is a pair of integers that implicitly convert to V. The following code must compile for any pair T:
(T x){ V a = x[0]; V b = x[1];}
The following must not compile:
(T x){ V c = x[2];}

alias CodepointSet = InversionList!(GcPolicy).InversionList;
The recommended default type for set of code points. For details, see the current implementation: InversionList.

struct CodepointInterval;
The recommended type of std.typecons.Tuple to represent [a, b) intervals of code points. As used in InversionList. Any interval type should pass isIntegralPair trait.

struct InversionList(SP = GcPolicy);

InversionList is a set of code points represented as an array of open-right [a, b) intervals (see CodepointInterval above). The name comes from the way the representation reads left to right. For instance a set of all values [10, 50), [80, 90), plus a singular value 60 looks like this:

10, 50, 60, 61, 80, 90

The way to read this is: start with negative meaning that all numbers smaller then the next one are not present in this set (and positive - the contrary). Then switch positive/negative after each number passed from left to right.

This way negative spans until 10, then positive until 50, then negative until 60, then positive until 61, and so on. As seen this provides a space-efficient storage of highly redundant data that comes in long runs. A description which Unicode character properties fit nicely. The technique itself could be seen as a variation on RLE encoding.

Sets are value types (just like int is) thus they are never aliased.

Example:
auto a = CodepointSet('a', 'z'+1);
auto b = CodepointSet('A', 'Z'+1);
auto c = a;
a = a | b;
assert(a == CodepointSet('A', 'Z'+1, 'a', 'z'+1));
assert(a != c);

See also unicode for simpler construction of sets from predefined ones.

Memory usage is 6 bytes per each contiguous interval in a set. The value semantics are achieved by using the COW technique and thus it's not safe to cast this type to shared.

Note:

It's not recommended to rely on the template parameters or the exact type of a current code point set in std.uni. The type and parameters may change when the standard allocators design is finalized. Use isCodepointSet with templates or just stick with the default alias CodepointSet throughout the whole code base.

this(Set)(Set set) if (isCodepointSet!Set);
Construct from another code point set of any type.

this(Range)(Range intervals) if (isForwardRange!Range && isIntegralPair!(ElementType!Range));
Construct a set from a forward range of code point intervals.

this()(uint[] intervals...);
Construct a set from plain values of code point intervals.

Example:
import std.algorithm;
auto set = CodepointSet('a', 'z'+1, 'а', 'я'+1);
foreach(v; 'a'..'z'+1)
    assert(set[v]);
// Cyrillic lowercase interval
foreach(v; 'а'..'я'+1)
    assert(set[v]);
//specific order is not required, intervals may interesect
auto set2 = CodepointSet('а', 'я'+1, 'a', 'd', 'b', 'z'+1);
//the same end result
assert(set2.byInterval.equal(set.byInterval));

@property auto byInterval();
Get range that spans all of the code point intervals in this InversionList.

Example:
import std.algorithm, std.typecons;
auto set = CodepointSet('A', 'D'+1, 'a', 'd'+1);
set.byInterval.equal([tuple('A', 'E'), tuple('a', 'e')]);

const bool opIndex(uint val);
Tests the presence of code point val in this set.

Example:
auto gothic = unicode.Gothic;
// Gothic letter ahsa
assert(gothic['\U00010330']);
// no ascii in Gothic obviously
assert(!gothic['$']);

size_t length();
Number of code points in this set

This opBinary(string op, U)(U rhs) if (isCodepointSet!U || is(U : dchar));

Sets support natural syntax for set algebra, namely:

Operator Math notation Description
& a ∩ b intersection
| a ∪ b union
- a ∖ b subtraction
~ a ~ b symmetric set difference i.e. (a ∪ b) \ (a ∩ b)

Example:
auto lower = unicode.LowerCase;
auto upper = unicode.UpperCase;
auto ascii = unicode.ASCII;

assert((lower & upper).empty); // no intersection
auto lowerASCII = lower & ascii;
assert(lowerASCII.byCodepoint.equal(iota('a', 'z'+1)));
// throw away all of the lowercase ASCII
assert((ascii - lower).length == 128 - 26);

auto onlyOneOf = lower ~ ascii;
assert(!onlyOneOf['Δ']); // not ASCII and not lowercase
assert(onlyOneOf['$']); // ASCII and not lowercase
assert(!onlyOneOf['a']); // ASCII and lowercase
assert(onlyOneOf['я']); // not ASCII but lowercase

// throw away all cased letters from ASCII
auto noLetters = ascii - (lower | upper);
assert(noLetters.length == 128 - 26*2);

This opOpAssign(string op, U)(U rhs) if (isCodepointSet!U || is(U : dchar));
The 'op=' versions of the above overloaded operators.

const bool opBinaryRight(string op : "in", U)(U ch) if (is(U : dchar));
Tests the presence of codepoint ch in this set, the same as opIndex.

Examples:
assert('я' in unicode.Cyrillic);
assert(!('z' in unicode.Cyrillic));

auto opUnary(string op : "!")();
Obtains a set that is the inversion of this set. See also inverted.

@property auto byCodepoint();
A range that spans each code point in this set.

Example:
import std.algorithm;
auto set = unicode.ASCII;
set.byCodepoint.equal(iota(0, 0x80));

void toString(scope void delegate(const(char)[]) sink);

Obtain textual representation of this set in from of open-right intervals and feed it to sink.

Used by various standard formatting facilities such as std.format.formattedWrite, std.stdio.write, std.stdio.writef, std.conv.to and others.

Example:
import std.conv;
assert(unicode.ASCII.to!string == "[0..128$(RPAREN)");

ref auto add()(uint a, uint b);
Add an interval [a, b) to this set.

Example:
CodepointSet someSet;
someSet.add('0', '5').add('A','Z'+1);
someSet.add('5', '9'+1);
assert(someSet['0']);
assert(someSet['5']);
assert(someSet['9']);
assert(someSet['Z']);

@property auto inverted();
Obtains a set that is the inversion of this set.

See the '!' opUnary for the same but using operators.

Example:
set = unicode.ASCII;
// union with the inverse gets all of the code points in the Unicode
assert((set | set.inverted).length == 0x110000);
// no intersection with the inverse
assert((set & set.inverted).empty);

string toSourceCode(string funcName = "");
Generates string with D source code of unary function with name of funcName taking a single dchar argument. If funcName is empty the code is adjusted to be a lambda function.

The function generated tests if the code point passed belongs to this set or not. The result is to be used with string mixin. The intended usage area is aggressive optimization via meta programming in parser generators and the like.

Note:
Use with care for relatively small or regular sets. It could end up being slower then just using multi-staged tables.

Example:
import std.stdio;

// construct set directly from [a, b$(RPAREN) intervals
auto set = CodepointSet(10, 12, 45, 65, 100, 200);
writeln(set);
writeln(set.toSourceCode("func"));

The above outputs something along the lines of:
bool func(dchar ch)
{
    if(ch < 45)
    {
        if(ch == 10 || ch == 11) return true;
        return false;
    }
    else if (ch < 65) return true;
    else
    {
        if(ch < 100) return false;
        if(ch < 200) return true;
        return false;
    }
}

const bool empty();
True if this set doesn't contain any code points.

Example:
CodepointSet emptySet;
assert(emptySet.length == 0);
assert(emptySet.empty);

template codepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)
A shorthand for creating a custom multi-level fixed Trie from a CodepointSet. sizes are numbers of bits per level, with the most significant bits used first.

Note:
The sum of sizes must be equal 21.

See Also:
toTrie, which is even simpler.

Example:
{
    import std.stdio;
    auto set = unicode("Number");
    auto trie = codepointSetTrie!(8, 5, 8)(set);
    writeln("Input code points to test:");
    foreach(line; stdin.byLine)
    {
        int count=0;
        foreach(dchar ch; line)
            if(trie[ch])// is number
                count++;
        writefln("Contains %d number code points.", count);
    }
}

template CodepointSetTrie(sizes...) if (sumOfIntegerTuple!sizes == 21)
Type of Trie generated by codepointSetTrie function.

template codepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)
A slightly more general tool for building fixed Trie for the Unicode data.

Specifically unlike codepointSetTrie it's allows creating mappings of dchar to an arbitrary type T.

Note:
Overload taking CodepointSets will naturally convert only to bool mapping Tries.

Example:
// pick characters from the Greek script
auto set = unicode.Greek;

// a user-defined property (or an expensive function)
// that we want to look up
static uint luckFactor(dchar ch)
{
    // here we consider a character lucky
    // if its code point has a lot of identical hex-digits
    // e.g. arabic letter DDAL (\u0688) has a "luck factor" of 2
    ubyte[6] nibbles; // 6 4-bit chunks of code point
    uint value = ch;
    foreach(i; 0..6)
    {
        nibbles[i] = value & 0xF;
        value >>= 4;
    }
    uint luck;
    foreach(n; nibbles)
        luck = cast(uint)max(luck, count(nibbles[], n));
    return luck;
}

// only unsigned built-ins are supported at the moment
alias LuckFactor = BitPacked!(uint, 3);

// create a temporary associative array (AA)
LuckFactor[dchar] map;
foreach(ch; set.byCodepoint)
    map[ch] = luckFactor(ch);

// bits per stage are chosen randomly, fell free to optimize
auto trie = codepointTrie!(LuckFactor, 8, 5, 8)(map);

// from now on the AA is not needed
foreach(ch; set.byCodepoint)
    assert(trie[ch] == luckFactor(ch)); // verify
// CJK is not Greek, thus it has the default value
assert(trie['\u4444'] == 0);
// and here is a couple of quite lucky Greek characters:
// Greek small letter epsilon with dasia
assert(trie['\u1F11'] == 3);
// Ancient Greek metretes sign
assert(trie['\U00010181'] == 3);

template CodepointTrie(T, sizes...) if (sumOfIntegerTuple!sizes == 21)
Type of Trie as generated by codepointTrie function.

auto toTrie(size_t level, Set)(Set set) if (isCodepointSet!Set);
Convenience function to construct optimal configurations for packed Trie from any set of code points.

The parameter level indicates the number of trie levels to use, allowed values are: 1, 2, 3 or 4. Levels represent different trade-offs speed-size wise.

Level 1 is fastest and the most memory hungry (a bit array).

Level 4 is the slowest and has the smallest footprint.

See the Synopsis section for example.

Note:
Level 4 stays very practical (being faster and more predictable) compared to using direct lookup on the set itself.

auto toDelegate(Set)(Set set) if (isCodepointSet!Set);

Builds a Trie with typically optimal speed-size trade-off and wraps it into a delegate of the following type: bool delegate(dchar ch).

Effectively this creates a 'tester' lambda suitable for algorithms like std.algorithm.find that take unary predicates.

See the Synopsis section for example.

struct unicode;
A single entry point to lookup Unicode code point sets by name or alias of a block, script or general category.

It uses well defined standard rules of property name lookup. This includes fuzzy matching of names, so that 'White_Space', 'white-SpAce' and 'whitespace' are all considered equal and yield the same set of white space characters.

@property auto opDispatch(string name)();
Performs the lookup of set of code points with compile-time correctness checking. This short-cut version combines 3 searches: across blocks, scripts, and common binary properties.

Note that since scripts and blocks overlap the usual trick to disambiguate is used - to get a block use unicode.InBlockName, to search a script use unicode.ScriptName.

See also block, script and (not included in this search) hangulSyllableType.

Example:
auto ascii = unicode.ASCII;
assert(ascii['A']);
assert(ascii['~']);
assert(!ascii['\u00e0']);
// matching is case-insensitive
assert(ascii == unicode.ascII);
assert(!ascii['à']);
// underscores, '-' and whitespace in names are ignored too
auto latin = unicode.in_latin1_Supplement;
assert(latin['à']);
assert(!latin['$']);
// BTW Latin 1 Supplement is a block, hence "In" prefix
assert(latin == unicode("In Latin 1 Supplement"));
import std.exception;
// run-time look up throws if no such set is found
assert(collectException(unicode("InCyrilliac")));

auto opCall(C)(in C[] name) if (is(C : dchar));
The same lookup across blocks, scripts, or binary properties, but performed at run-time. This version is provided for cases where name is not known beforehand; otherwise compile-time checked opDispatch is typically a better choice.

See the table of properties for available sets.

struct block;
Narrows down the search for sets of code points to all Unicode blocks.

See also table of properties.

Note:
Here block names are unambiguous as no scripts are searched and thus to search use simply unicode.block.BlockName notation.

See table of properties for available sets.

Example:
// use .block for explicitness
assert(unicode.block.Greek_and_Coptic == unicode.InGreek_and_Coptic);

struct script;
Narrows down the search for sets of code points to all Unicode scripts.

See the table of properties for available sets.

Example:
auto arabicScript = unicode.script.arabic;
auto arabicBlock = unicode.block.arabic;
// there is an intersection between script and block
assert(arabicBlock['؁']);
assert(arabicScript['؁']);
// but they are different
assert(arabicBlock != arabicScript);
assert(arabicBlock == unicode.inArabic);
assert(arabicScript == unicode.arabic);

struct hangulSyllableType;
Fetch a set of code points that have the given hangul syllable type.

Other non-binary properties (once supported) follow the same notation - unicode.propertyName.propertyValue for compile-time checked access and unicode.propertyName(propertyValue) for run-time checked one.

See the table of properties for available sets.

Example:
// L here is syllable type not Letter as in unicode.L short-cut
auto leadingVowel = unicode.hangulSyllableType("L");
// check that some leading vowels are present
foreach(vowel; '\u1110'..'\u115F')
    assert(leadingVowel[vowel]);
assert(leadingVowel == unicode.hangulSyllableType.L);

size_t graphemeStride(C)(in C[] input, size_t index) if (is(C : dchar));
Returns the length of grapheme cluster starting at index. Both the resulting length and the index are measured in code units.

Example:
// ASCII as usual is 1 code unit, 1 code point etc.
assert(graphemeStride("  ", 1) == 1);
// A + combing ring above
string city = "A\u030Arhus";
size_t first = graphemeStride(city, 0);
assert(first == 3); //\u030A has 2 UTF-8 code units
assert(city[0..first] == "A\u030A");
assert(city[first..$] == "rhus");

Grapheme decodeGrapheme(Input)(ref Input inp) if (isInputRange!Input && is(Unqual!(ElementType!Input) == dchar));
Reads one full grapheme cluster from an input range of dchar inp.

For examples see the Grapheme below.

Note:
This function modifies inp and thus inp must be an L-value.

auto byGrapheme(Range)(Range range) if (isInputRange!Range && is(Unqual!(ElementType!Range) == dchar));

Iterate a string by grapheme.

Useful for doing string manipulation that needs to be aware of graphemes.

See Also:
byCodePoint

Examples:
auto text = "noe\u0308l"; // noël using e + combining diaeresis
assert(text.walkLength == 5); // 5 code points

auto gText = text.byGrapheme;
assert(gText.walkLength == 4); // 4 graphemes

assert(gText.take(3).equal("noe\u0308".byGrapheme));
assert(gText.drop(3).equal("l".byGrapheme));

auto byCodePoint(Range)(Range range) if (isInputRange!Range && is(Unqual!(ElementType!Range) == Grapheme));
Range byCodePoint(Range)(Range range) if (isInputRange!Range && is(Unqual!(ElementType!Range) == dchar));

Lazily transform a range of Graphemes to a range of code points.

Useful for converting the result to a string after doing operations on graphemes.

Acts as the identity function when given a range of code points.

Examples:
import std.string : text;

string s = "noe\u0308l"; // noël

// reverse it and convert the result to a string
string reverse = s.byGrapheme
    .array
    .retro
    .byCodePoint
    .text;

assert(reverse == "le\u0308on"); // lëon

struct Grapheme;

A structure designed to effectively pack characters of a grapheme cluster.

Grapheme has value semantics so 2 copies of a Grapheme always refer to distinct objects. In most actual scenarios a Grapheme fits on the stack and avoids memory allocation overhead for all but quite long clusters.

Example:
import std.algorithm;
string bold = "ku\u0308hn";

// note that decodeGrapheme takes parameter by ref
// slicing a grapheme yields a range of dchar
assert(decodeGrapheme(bold)[].equal("k"));

// the next grapheme is 2 characters long
auto wideOne = decodeGrapheme(bold);
assert(wideOne.length == 2);
assert(wideOne[].equal("u\u0308"));

// the usual range manipulation is possible
assert(wideOne[].filter!isMark.equal("\u0308"));

See also decodeGrapheme, graphemeStride.

const pure nothrow @trusted dchar opIndex(size_t index);
Gets a code point at the given index in this cluster.

pure nothrow @trusted void opIndexAssign(dchar ch, size_t index);
Writes a code point ch at given index in this cluster.

Warning:
Use of this facility may invalidate grapheme cluster, see also Grapheme.valid.

Example:
auto g = Grapheme("A\u0302");
assert(g[0] == 'A');
assert(g.valid);
g[1] = '~'; // ASCII tilda is not a combining mark
assert(g[1] == '~');
assert(!g.valid);

pure nothrow @trusted auto opSlice(size_t a, size_t b);
pure nothrow @trusted auto opSlice();
Random-access range over Grapheme's characters.

Warning:
Invalidates when this Grapheme leaves the scope, attempts to use it then would lead to memory corruption.

const pure nothrow @property @trusted size_t length();
Grapheme cluster length in code points.

ref auto opOpAssign(string op)(dchar ch);
Append character ch to this grapheme.

Warning:
Use of this facility may invalidate grapheme cluster, see also valid.

Example:
auto g = Grapheme("A");
assert(g.valid);
g ~= '\u0301';
assert(g[].equal("A\u0301"));
assert(g.valid);
g ~= "B";
// not a valid grapheme cluster anymore
assert(!g.valid);
// still could be useful though
assert(g[].equal("A\u0301B"));
See also Grapheme.valid below.

ref auto opOpAssign(string op, Input)(Input inp) if (isInputRange!Input && is(ElementType!Input : dchar));
Append all characters from the input range inp to this Grapheme.

bool valid()();
True if this object contains valid extended grapheme cluster. Decoding primitives of this module always return a valid Grapheme.

Appending to and direct manipulation of grapheme's characters may render it no longer valid. Certain applications may chose to use Grapheme as a "small string" of any code points and ignore this property entirely.

int sicmp(S1, S2)(S1 str1, S2 str2) if (isForwardRange!S1 && is(Unqual!(ElementType!S1) == dchar) && isForwardRange!S2 && is(Unqual!(ElementType!S2) == dchar));

Does basic case-insensitive comparison of strings str1 and str2. This function uses simpler comparison rule thus achieving better performance then icmp. However keep in mind the warning below.

Warning:
This function only handles 1:1 code point mapping and thus is not sufficient for certain alphabets like German, Greek and few others.

Example:
assert(sicmp("Август", "авгусТ") == 0);
// Greek also works as long as there is no 1:M mapping in sight
assert(sicmp("ΌΎ", "όύ") == 0);
// things like the following won't get matched as equal
// Greek small letter iota with dialytika and tonos
assert(sicmp("ΐ", "\u03B9\u0308\u0301") != 0);

// while icmp has no problem with that
assert(icmp("ΐ", "\u03B9\u0308\u0301") == 0);
assert(icmp("ΌΎ", "όύ") == 0);

int icmp(S1, S2)(S1 str1, S2 str2) if (isForwardRange!S1 && is(Unqual!(ElementType!S1) == dchar) && isForwardRange!S2 && is(Unqual!(ElementType!S2) == dchar));

Does case insensitive comparison of str1 and str2. Follows the rules of full case-folding mapping. This includes matching as equal german ß with "ss" and other 1:M code point mappings unlike sicmp. The cost of icmp being pedantically correct is slightly worse performance.

Example:
assert(icmp("Rußland", "Russland") == 0);
assert(icmp("ᾩ -> \u1F70\u03B9", "\u1F61\u03B9 -> ᾲ") == 0);

@trusted ubyte combiningClass(dchar ch);

Returns the combining class of ch.

Example:
// shorten the code
alias CC = combiningClass;

// combining tilda
assert(CC('\u0303') == 230);
// combining ring below
assert(CC('\u0325') == 220);
// the simple consequence is that  "tilda" should be
// placed after a "ring below" in a sequence

enum UnicodeDecomposition: int;
Unicode character decomposition type.

Canonical
Canonical decomposition. The result is canonically equivalent sequence.

Compatibility
Compatibility decomposition. The result is compatibility equivalent sequence.

Note:
Compatibility decomposition is a lossy conversion, typically suitable only for fuzzy matching and internal processing.

@trusted dchar compose(dchar first, dchar second);
Try to canonically compose 2 characters. Returns the composed character if they do compose and dchar.init otherwise.

The assumption is that first comes before second in the original text, usually meaning that the first is a starter.

Note:
Hangul syllables are not covered by this function. See composeJamo below.

Example:
assert(compose('A','\u0308') == '\u00C4');
assert(compose('A', 'B') == dchar.init);
assert(compose('C', '\u0301') == '\u0106');
// note that the starter is the first one
// thus the following doesn't compose
assert(compose('\u0308', 'A') == dchar.init);

Grapheme decompose(UnicodeDecomposition decompType = Canonical)(dchar ch);
Returns a full Canonical (by default) or Compatibility decomposition of character ch. If no decomposition is available returns a Grapheme with the ch itself.

Note:
This function also decomposes hangul syllables as prescribed by the standard. See also decomposeHangul for a restricted version that takes into account only hangul syllables but no other decompositions.

Example:
import std.algorithm;
assert(decompose('Ĉ')[].equal("C\u0302"));
assert(decompose('D')[].equal("D"));
assert(decompose('\uD4DC')[].equal("\u1111\u1171\u11B7"));
assert(decompose!Compatibility('¹').equal("1"));

@trusted Grapheme decomposeHangul(dchar ch);
Decomposes a Hangul syllable. If ch is not a composed syllable then this function returns Grapheme containing only ch as is.

Example:
import std.algorithm;
assert(decomposeHangul('\uD4DB')[].equal("\u1111\u1171\u11B6"));

@trusted dchar composeJamo(dchar lead, dchar vowel, dchar trailing = (dchar).init);
Try to compose hangul syllable out of a leading consonant (lead), a vowel and optional trailing consonant jamos.

On success returns the composed LV or LVT hangul syllable.

If any of lead and vowel are not a valid hangul jamo of the respective character class returns dchar.init.

Example:
assert(composeJamo('\u1111', '\u1171', '\u11B6') == '\uD4DB');
// leaving out T-vowel, or passing any codepoint
// that is not trailing consonant composes an LV-syllable
assert(composeJamo('\u1111', '\u1171') == '\uD4CC');
assert(composeJamo('\u1111', '\u1171', ' ') == '\uD4CC');
assert(composeJamo('\u1111', 'A') == dchar.init);
assert(composeJamo('A', '\u1171') == dchar.init);

enum NormalizationForm: int;
Enumeration type for normalization forms, passed as template parameter for functions like normalize.

NFC
NFD
NFKC
NFKD
Shorthand aliases from values indicating normalization forms.

inout(C)[] normalize(NormalizationForm norm = NFC, C)(inout(C)[] input);
Returns input string normalized to the chosen form. Form C is used by default.

For more information on normalization forms see the normalization section.

Note:
In cases where the string in question is already normalized, it is returned unmodified and no memory allocation happens.

Example:
// any encoding works
wstring greet = "Hello world";
assert(normalize(greet) is greet); // the same exact slice

// An example of a character with all 4 forms being different:
// Greek upsilon with acute and hook symbol (code point 0x03D3)
assert(normalize!NFC("ϓ") == "\u03D3");
assert(normalize!NFD("ϓ") == "\u03D2\u0301");
assert(normalize!NFKC("ϓ") == "\u038E");
assert(normalize!NFKD("ϓ") == "\u03A5\u0301");

bool allowedIn(NormalizationForm norm)(dchar ch);
Tests if dchar ch is always allowed (Quick_Check=YES) in normalization form norm.
// e.g. Cyrillic is always allowed, so is ASCII
assert(allowedIn!NFC('я'));
assert(allowedIn!NFD('я'));
assert(allowedIn!NFKC('я'));
assert(allowedIn!NFKD('я'));
assert(allowedIn!NFC('Z'));

pure nothrow @safe bool isWhite(dchar c);
Whether or not c is a Unicode whitespace character. (general Unicode category: Part of C0(tab, vertical tab, form feed, carriage return, and linefeed characters), Zs, Zl, Zp, and NEL(U+0085))

pure nothrow @safe bool isLower(dchar c);
Return whether c is a Unicode lowercase character.

pure nothrow @safe bool isUpper(dchar c);
Return whether c is a Unicode uppercase character.

pure nothrow @safe dchar toLower(dchar c);
If c is a Unicode uppercase character, then its lowercase equivalent is returned. Otherwise c is returned.

Warning:
certain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toLower which takes full string instead.

pure @trusted void toLowerInPlace(C)(ref C[] s) if (is(C == char) || is(C == wchar) || is(C == dchar));
Converts s to lowercase (by performing Unicode lowercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. If s does not have any uppercase characters, then s is unaltered.

pure @trusted void toUpperInPlace(C)(ref C[] s) if (is(C == char) || is(C == wchar) || is(C == dchar));
Converts s to uppercase (by performing Unicode uppercase mapping) in place. For a few characters string length may increase after the transformation, in such a case the function reallocates exactly once. If s does not have any lowercase characters, then s is unaltered.

pure @trusted S toLower(S)(S s) if (isSomeString!S);
Returns a string which is identical to s except that all of its characters are converted to lowercase (by preforming Unicode lowercase mapping). If none of s characters were affected, then s itself is returned.

pure nothrow @safe dchar toUpper(dchar c);
If c is a Unicode lowercase character, then its uppercase equivalent is returned. Otherwise c is returned.

Warning:
Certain alphabets like German and Greek have no 1:1 upper-lower mapping. Use overload of toUpper which takes full string instead.

pure @trusted S toUpper(S)(S s) if (isSomeString!S);
Returns a string which is identical to s except that all of its characters are converted to uppercase (by preforming Unicode uppercase mapping). If none of s characters were affected, then s itself is returned.

pure nothrow @safe bool isAlpha(dchar c);
Returns whether c is a Unicode alphabetic character (general Unicode category: Alphabetic).

pure nothrow @safe bool isMark(dchar c);
Returns whether c is a Unicode mark (general Unicode category: Mn, Me, Mc).

pure nothrow @safe bool isNumber(dchar c);
Returns whether c is a Unicode numerical character (general Unicode category: Nd, Nl, No).

pure nothrow @safe bool isPunctuation(dchar c);
Returns whether c is a Unicode punctuation character (general Unicode category: Pd, Ps, Pe, Pc, Po, Pi, Pf).

pure nothrow @safe bool isSymbol(dchar c);
Returns whether c is a Unicode symbol character (general Unicode category: Sm, Sc, Sk, So).

pure nothrow @safe bool isSpace(dchar c);
Returns whether c is a Unicode space character (general Unicode category: Zs)

Note:
This doesn't include '\n', '\r', \t' and other non-space character. For commonly used less strict semantics see isWhite.

pure nothrow @safe bool isGraphical(dchar c);
Returns whether c is a Unicode graphical character (general Unicode category: L, M, N, P, S, Zs).

pure nothrow @safe bool isControl(dchar c);
Returns whether c is a Unicode control character (general Unicode category: Cc).

pure nothrow @safe bool isFormat(dchar c);
Returns whether c is a Unicode formatting character (general Unicode category: Cf).

pure nothrow @safe bool isPrivateUse(dchar c);
Returns whether c is a Unicode Private Use code point (general Unicode category: Co).

pure nothrow @safe bool isSurrogate(dchar c);
Returns whether c is a Unicode surrogate code point (general Unicode category: Cs).

pure nothrow @safe bool isSurrogateHi(dchar c);
Returns whether c is a Unicode high surrogate (lead surrogate).

pure nothrow @safe bool isSurrogateLo(dchar c);
Returns whether c is a Unicode low surrogate (trail surrogate).

pure nothrow @safe bool isNonCharacter(dchar c);
Returns whether c is a Unicode non-character i.e. a code point with no assigned abstract character. (general Unicode category: Cn)