Ideal finger 001
Finite state automaton index for /usr/share/dict/words
.
A finite set of known items. Represent them as byte sequences, hereafter words. Recognize them in insipid soups of bytes. Use a device, a finite state automaton. Now build it, naïvely, in OCaml. Hold on this is only a first step.
Finite state automata are directed graphs whose nodes are states and edges labelled arcs. A label is the byte consumed from the input word when the arc is followed. We totally order labels, it is always useful.
module Label =
type t = int (* byte, really *)
let compare : t -> t -> int = Pervasives.compare
end
States can be accepting, whenever they recognize a word. They are determined by their set of outgoing arcs. This set of arcs is just a partial map from labels to target states. We also add a mysterious integer tag to states, it will be useful later, ignore it for now.
module State =
module Lmap = Map.Make (Label)
type t = { accepts : bool; arcs : arcs; tag : int }
and arcs = t Lmap.t
To construct states, we start from a distinguished empty state which has no arcs and does not accept.
val empty : t
let empty = { accepts = false; arcs = Lmap.empty; tag = 0 }
Starting from this input black hole we construct new states, by making them accepting or by repeatedly adding arcs leading to other states.
val accepting : t -> bool -> t
let accepting s accepts = { s with accepts }
val add_arc : t -> Label.t -> t -> t
let add_arc s l t = { s with arcs = Lmap.add l t s.arcs }
For recognizing words we only need two operators on states. The first one to test if a state accepts, in case we end up there with the input exhausted. The second one to determine, given a state and a label, in which target state we end up, if there is one.
val accepts : t -> bool
let accepts s = s.accepts
val find : t -> Label.t -> t option
let find s l = try Some (Lmap.find l s.arcs) with
| Not_found -> None
Like labels we want to be able to compare states and especially test them for equality. We ignore tags. When are two states equal ? When they recognize the same set of words. How do we know ? When they are both either accepting or non accepting and have exactly the same set of arcs leading to states that recognizes the same set of words. Recurse.
val compare : t -> t -> int
let rec compare s0 s1 =
(* Assumes an acyclic graph of states *)
let c = Pervasives.compare s0.accepts s1.accepts in
if c <> 0 then c else
Lmap.compare compare s0.arcs s1.arcs
Finally the following functions allow to fold over the arcs of a state and access its tag or change it.
val fold_arcs : (Label.t -> t -> 'a -> 'a) -> t -> 'a -> 'a
let fold_arcs f s acc = Lmap.fold f s.arcs acc
val tag : t -> int
let tag s = s.tag
val retag : t -> int -> t
let retag s tag = { s with tag }
end
We have states. By definition they completely define the reachable subgraph of the larger automaton they are part of. Hence with this representation a finite state automaton is simply defined by its starting state.
module Fsa = struct
type t = State.t
Recognizing a word is straightforward. We start from the initial state and follow the arcs according to the input byte sequence and see if we end up in a final state.
val accepts : t -> string -> bool
let accepts s w =
let max = String.length w - 1 in
let rec loop k s =
if k > max then State.accepts s else
match State.find s (Char.code w.[k]) with
| None -> false
| Some t -> loop (k + 1) t
in
loop 0 s
Our goal is to build automata for recognizing finite sets of items. For that we start from a distinguished empty automaton which does not accept any byte sequence. Remember the black hole state.
val empty : t
let empty = State.empty
To this empty automaton we want to repeateadly add words to
recognize. To do this, given a new word, we first want to see which
prefix of the word is already recognized by the automaton.
For example if we have an automaton recognizing the word "words"
and we try to add the word "worlds"
we would get to the gray
state below before being stuck, lacking an arc for label l
.
The function below finds that stuck state and returns the reverse list of states we traversed to get to it. The first item of the list is the stuck state itself. The integer returned is the start position of the word's suffix that could not be recognized. If this value equals the length of the word, all needed states are present in the automaton; but we may still need to change the "stuck" (not really in that case) state to accept.
val find_prefix : t -> string -> int * State.t list
let find_prefix s w =
let max = String.length w - 1 in
let rec loop k rev_path =
if k > max then k, rev_path else
let s = List.hd rev_path in
match State.find s (Char.code w.[k]) with
| None -> k, rev_path
| Some t -> loop (k + 1) (t :: rev_path)
in
loop 0 [s]
Now that we know where we are stuck we can just fork a new branch of states from there to recognize the missing suffix. Applying this to our example results in the automaton below.
This is done with two functions. The first one adds the branch to the stuck state. It does this by constructing the branch backwards, from a newly created accepting state to the stuck state. The second one rebuilds the automaton we deconstructed from the stuck state back to the start state.
val add_branch : State.t -> string -> pos:int -> State.t
let add_branch stuck w ~pos =
let max = String.length w - 1 in
let rec loop k next =
let l = Char.code w.[k] in
if k = pos then (State.add_arc stuck l next) else
loop (k - 1) (State.add_arc State.empty l next)
in
if pos > max then State.accepting stuck true else
loop max (State.accepting State.empty true)
val rebuild : string -> pos:int -> State.t list -> t =
let rec rebuild w ~pos rev_path = match rev_path with
| [start] -> start
| t :: s :: rev_path ->
let s = State.add_arc s (Char.code w.[pos]) t in
rebuild w ~pos:(pos - 1) (s :: rev_path)
| [] -> assert false
With these functions it now becomes easy to add new words to an automaton.
val add_word : t -> string -> t
let add_word s w =
let pos, rev_path = find_prefix s w in
let stuck = List.hd rev_path in
let unstuck = add_branch stuck w ~pos in
rebuild w ~pos:(pos - 1) (unstuck :: List.tl rev_path)
However if go back to the automaton with words and worlds, we see
that creating new branches creates states that are equal, i.e. that
recognize the same set of words. In this example, the target states of
d
-labelled arcs all recognize the same set of words, namely
{s}. In fact this algorithm builds a trie where each final state
is unique to each word. It is easy to see that this is very wasteful
and that common word suffixes will never share the same
representation with this algorithm. One way to side-step the issue is
to merge equivalent states after the fact to get to a minimal
automaton like the following one.
Given that our automaton is acyclic by construction, minimizing is not too hard. Just perform a post-order enumeration of the directed acyclic graph and with each state look into a set of states to see if there is not already an equivalent one that exists. If that is the case use that one instead.
module Sset = Set.Make (State)
val minimize : t -> t
let minimize s =
(* Not t.r. depth is bounded by max added word length. *)
let rec minimize exists s =
try exists, Sset.find s exists (* replace *) with
| Not_found ->
let explore_arc l t (exists, s) =
let exists, t = minimize exists t in
exists, State.add_arc s l t
in
let empty = State.(accepting empty (accepts s)) in
let init = exists, empty in
let exists, s = State.fold_arcs explore_arc s init in
Sset.add s exists, s
in
snd (minimize Sset.empty s)
That's it, the first step is over. We can build minimal recognizers for finite sets of items.
Recognizing items is not enough. We want to associate data to items so that they can be queried for that information. Items are also ordered by their lexicographical byte order. We want to be able to perform range queries returning the data for all the items existing between two given items.
Considering these wishes, the unminimized trie structure produced by
the add_word
function retrospectively feels handy: each word ends
with its own accepting state. We could simply store an index value for
the word in the tag of that state and use this to index a table of
associated data. Ranges seem tricky though and tries are a waste of
memory anyways.
Let us go back to our minimal automaton. The problem is that many words end up being recognized with the same final state, so the tag cannot directly be used as an index value. However it turns out there exists a beautiful state tagging scheme applicable to any automaton and hence to the minimal one aswell, that will allow us to recover a unique index value during word recognition.
It is very simple. Given a state, tag it with the number of words that can be recognized by starting from this state. Remember, the set of added words is finite, the automaton is not a general one. It is simpler, acyclic, this number is thus finite.
How can we compute this number ? For a given state, we start with 0 if the state does not accept or 1 if it does and sum for each of its arcs, the number of words accepted by the target state. Recurse. Here again a simple post-order state enumeration — or let us say reconstruction — accumulating accepted word counts in state tags will compute the right numbers.
We define an index data structure as an automaton whose states are tagged with accepted word counts. We build it by minimizing a given automaton and retagging it as defined above — this could of course be done in a single pass.
type index = t
val index : t -> index
let index s =
let s = minimize s in
(* Not t.r. depth is bounded by max added word length.
Computes more than needed, since we don't maintain
the set of visited nodes. *)
let rec retag s =
let explore_arc l t (wcount, s) =
let t = retag t in
(wcount + State.tag t, State.add_arc s l t)
in
let wcount = if State.accepts s then 1 else 0 in
let empty = State.(accepting empty (accepts s)) in
let init = wcount, empty in
let wcount, s = State.fold_arcs explore_arc s init in
State.retag s wcount
in
retag s
A by-product, it is now trivial to determine how many words the automaton accepts.
val index_word_count : index -> int
let index_word_count s = State.tag s
Now gather all your items, order them lexicographically in a list. For each item consider its zero-based position in the list. This gives you a unique index number for an item. It happens that for a word this number can be derived with little cost from the tagged automaton during word acceptance.
Assume you are in a state s
and found the label l
you need to
follow. Consider all the arcs that have a label smaller than l
, had
you taken one of these arcs you would have ended up with a word that
is lexicographically smaller than yours. Summing up the number of
words that are accepted by following each of these arcs, i.e. summing
up their target's tag, and adding 1 if s
itself is an accepting
state gives you the number of elements in the lexicographical list of
items you have skipped by taking the l
-labelled arc. By computing and
accumulating this number in each traversed state during word
recognition you end up with the word's unique position in the
lexicographical list.
val find_index : index -> string -> int option
let find_index s w =
let max = String.length w - 1 in
let rec loop k s i =
if k > max
then (if State.accepts s then Some i else None) else
let l = Char.code w.[k] in
match State.find s l with
| None -> None
| Some t ->
let add_skip l' t i =
if l' < l then (State.tag t) + i else i
in
let i = State.fold_arcs add_skip s i in
loop (k + 1) t (if State.accepts s then (i + 1) else i)
in
loop 0 s 0
Definining the index number as being the position in the lexicographical list makes range queries trivial. Compute the index of the starting range. Compute the index of the end range. You have the index value interval you need to consider.
Here is a wish we did not have. Given and index i
returned by
find_index
find the corresponding word. This means we need to count
i+1
items in the lexicographical list to get to our word. As we
explained above we know how much items we skip in the lexicographical
list by taking or not taking an arc. At each state we need to take the
smallest arc that will allow us to skip enough lexicographically
smaller items to get to our word. Accumulating the labels of the arcs
we follow constructs the word corresponding to the index value.
val find_word : index -> int -> string option
let find_word s i =
let b = Buffer.create 42 in
let rec loop s count =
let count = if State.accepts s then count - 1 else count in
if count = 0 then Some (Buffer.contents b) else
let find_arc l t (found, count as acc) = match found with
| Some _ -> acc
| None ->
let wcount = State.tag t in
if wcount < count then (found, count - wcount) else
(Buffer.add_char b (Char.chr l); Some t, count)
in
match State.fold_arcs find_arc s (None, count) with
| None, _ -> None (* i is out of bounds *)
| Some t, count -> loop t count
in
loop s (i + 1)
end
Given a set of items, its tagged finite state automaton maintains a bijective map between an item and its zero-based position in the lexicographical order of items.
Other words. What do we get here ? A perfect, minimal, order-preserving and invertible hash function for any given finite set of items. Perfect, no collisions. Minimal, maps n items to n consecutive integers. Order-preserving, maps ordered items to ordered hashes. Invertible, give me the hash, I give you back the smoking word.
Throw out the buckets !
And the baby.
This is a naïve implementation. There are two main issues.
Firstly, our finite state automaton is minimal in the sense that it maximizes data sharing among accepted words. But its memory representation is not compact, it uses a huge amount of big, fat, memory pointers. This paper has a few hints on how finite state automata can be represented compactly in memory.
Secondly, building first the trie and only minimize it later can waste a huge amount of memory when you process millions of items. The solution is to construct the minimal automaton on the fly, this paper has an algorithm.
This state numbering scheme for constructing unique indices seems to be described for the first time in this paper.