There are many ways to spend ones free time. I like solving crossword puzzles, and recently I’ve been playing around with Bison and Flex for generating parsers. The former lead me to think of dictionaries and tries, and the latter to interesting applications of code generation. In this article I explain where my weekend went, and describe a wonderful Frankensteinian mix of C switch statements and tries.
As a bit of background, a trie or prefix tree is a data structure often used to store spelling dictionaries. It can be seen as a state machine where each state is a prefix and transitions encode the set of recognized characters after the current prefix. Tries are usually very compact as all words with a common prefix share the same data, and search is fast as it only has to look at a number of nodes roughly proportional to the length of the search key. An “accept state” represents a recognized word in the dictionary.
The illustration above is created by various users on Wikimedia Commons. The numbered nodes are accept states.
Another way to think about tries is a series of nested
switch statements in C.
Given inputs as null-terminated strings of bytes, states are
and transitions are
case labels. Accept states are
case labels matching
NUL. Here the data is directly embedded as code.
Any time there is no matching
case label, the default behavior is to skip the
switch entirely and fall through to
return 0. (Edit: Thanks to Reddit user kmdreko
for noting that breaks are required here. It was done by the script, but not in
this example code.)
So how do switch statements fare with a larger dictionary? I wrote a Python script to generate the C code and ran it on the Swedish language wordlist ss100.txt (converted to UTF-8).
Running the script on my wordlist yielded a 1.9 million line C file with a single function. Compiling it with Clang at the default optimization level took half a minute and resulted in a 19 MB binary. To contrast, my Vim binary is about a tenth of that size.
Why is it so large? The wordlist used is only 2.5 MB in size! To ensure this was not unique to a particular wordlist, I also tested with the notorious enable1.txt. It yielded similar results:
|ss100.utf8||Swedish||2.5 M||222 K||372 K||593 K|
|enable1.txt||English||1.8 M||173 K||388 K||561 K|
The first reason for the huge binary is that by default, Clang
generates a lot of unnecessary code. An example that caught my eye is
a relative jump with delta 0, i.e. a no-op. It is encoded as five bytes and
occurred 663,975 times, and thus accounts for 17% of the file!
-O3 reduced the size to 4.1 MB, and with
-Oz I managed
2.5 MB. Clang took hours to optimize the code in each case, and no wonder; compilers
are not made for million line functions. The code will still kill the
instruction cache though, as no line is executed twice.
In the end, I learned some things about compilers, tries and code generation. I didn’t succeed in compacting the wordlist, but at least it’s faster than grepping.