There are many ways to spend ones free time. I like solving crossword puzzles, and recently I’ve been playing around with Bison and Flex for generating parsers. The former lead me to think of dictionaries and tries, and the latter to interesting applications of code generation. In this article I explain where my weekend went, and describe a wonderful Frankensteinian mix of C switch statements and tries.

As a bit of background, a trie or prefix tree is a data structure often used to store spelling dictionaries. It can be seen as a state machine where each state is a prefix and transitions encode the set of recognized characters after the current prefix. Tries are usually very compact as all words with a common prefix share the same data, and search is fast as it only has to look at a number of nodes roughly proportional to the length of the search key. An “accept state” represents a recognized word in the dictionary.

Illustration of a trie

The illustration above is created by various users on Wikimedia Commons. The numbered nodes are accept states.

Another way to think about tries is a series of nested switch statements in C. Given inputs as null-terminated strings of bytes, states are switch statements and transitions are case labels. Accept states are case labels matching NUL. Here the data is directly embedded as code.

/* Return 1 if key is a greeting, or 0 otherwise. */
int trie(const char *key)
{
    switch (key[0]) {
    case 'h':
        switch (key[1]) {
        case 'i':
            switch (key[2]) {
            case '\0':
                return 1;
            }
            break;
        case 'e':
            switch (key[2]) {
            case 'y':
                switch (key[3]) {
                case '\0':
                    return 1;
                }
                break;
            }
            break;
        }
        break;
    case 'y':
        switch (key[1]) {
        case 'o':
            switch (key[2]) {
            case '\0':
                return 1;
            }
            break;
        }
        break;
    }

    return 0;
}

Any time there is no matching case label, the default behavior is to skip the switch entirely and fall through to return 0. (Edit: Thanks to Reddit user kmdreko for noting that breaks are required here. It was done by the script, but not in this example code.)

So how do switch statements fare with a larger dictionary? I wrote a Python script to generate the C code and ran it on the Swedish language wordlist ss100.txt (converted to UTF-8).

import sys
import string
import itertools

def generate_switches(alphabet, words, prefix=b""):
    level = len(prefix)
    indent = (level + 1) * "  "
    print(f"{indent}switch (w[{level}]) {{")

    # Horribly inefficient, but orders of magnitute faster than compiling the
    # mess it makes
    for symbol in alphabet:
        next_prefix = bytes(itertools.chain(prefix, symbol.to_bytes(1, sys.byteorder)))
        accept = False
        next_words = []
        for word in words:
            if word == next_prefix:
                accept = True
            if word.startswith(next_prefix):
                next_words.append(word)

        if next_words or accept:
            print(f"{indent}case {hex(symbol)}:")
            generate_switches(alphabet, next_words, next_prefix)
            print(f"{indent}break;")

    if prefix in words:
        print(f"{indent}case 0x0:")
        print(f"{indent}return 1;")

    print(f"{indent}}}")

def wrap_in_function(gen_func, name):
    print("/* Generated by triegen.py */")
    print(f"int {name}(const unsigned char *w)")
    print("{")
    gen_func()
    print("  return 0;")
    print("}")

def main():
    words = []
    alphabet = set()
    for line in sys.stdin:
        line = line.strip()
        if line:
            line = bytes(line, "utf-8")
            words.append(line)
            for symbol in line:
                alphabet.add(symbol)

    wrap_in_function(lambda: generate_switches(alphabet, words), "gt_test")

if __name__ == "__main__":
    main()

Running the script on my wordlist yielded a 1.9 million line C file with a single function. Compiling it with Clang at the default optimization level took half a minute and resulted in a 19 MB binary. To contrast, my Vim binary is about a tenth of that size.

Why is it so large? The wordlist used is only 2.5 MB in size! To ensure this was not unique to a particular wordlist, I also tested with the notorious enable1.txt. It yielded similar results:

Wordlist Language Bytes Words States Transitions
ss100.utf8 Swedish 2.5 M 222 K 372 K 593 K
enable1.txt English 1.8 M 173 K 388 K 561 K

The first reason for the huge binary is that by default, Clang generates a lot of unnecessary code. An example that caught my eye is jmp 0, a relative jump with delta 0, i.e. a no-op. It is encoded as five bytes and occurred 663,975 times, and thus accounts for 17% of the file!

Compiling with -O3 reduced the size to 4.1 MB, and with -Oz I managed 2.5 MB. Clang took hours to optimize the code in each case, and no wonder; compilers are not made for million line functions. The code will still kill the instruction cache though, as no line is executed twice.

In the end, I learned some things about compilers, tries and code generation. I didn’t succeed in compacting the wordlist, but at least it’s faster than grepping.