A Dictionary Trie

If you were given the following words:

rebate
reborn
realize
relief
realizes
redder
red

You will immediately notice that they all share the same prefix.

So now if I asked you to create a data structure to store these words (don't worry about why yet), their shared prefix should get you thinking about trees.

So let us create a tree.



As you note, the algorithm to retrieve the words from this tree is simple:

Retrieval Algorithm
  1. Start at the root and walk down to each end node.
  2. As you walk down, accumulate the string fragments stored in the node by concatenation: "re" + "bate" yields "rebate".
  3. Once you reach a leaf node, you have completed a word.

If we were to program this data structure, it would consist of a single node structure with pointers to other nodes (as children). The following psuedo-code shows this.

Let's continue refining our tree structure. First thing you will note is that the above tree is only one level deep.

This highlights the obvious need to continue generating prefixes down the tree. For instance, the fragments "bate" and "born" share a prefix of a "b", and "alizes" and "alize" share "alize".

So let's grow branches to our tree:
Now this data structure is more efficiently storing prefixes than the previous. But does our algorithm to retrieve words still work? (Hint: no. (oops, that wasn't a hint))

Flawed Retrieval Algorithm
  1. Start at the root and walk down to each end node.
  2. As you walk down, accumulate the string fragments stored in the node by concatenation.
  3. Once you reach a leaf node, you have completed a word.
So what fails here?

Step 3 does. Our new tree has an entire word as a prefix and we would reach a word before we reached a leaf node.

The word "realize" is a prefix to the word "realizes" but our third step in the algorithm would skip it. Same with "red" and "redder".

Solution? Easy. Let's add extra information to each node. We will call this boolean information a word-indicator:
New Retrieval Algorithm
  1. Start at the root and walk down to each end node.
  2. As you walk down, accumulate the string fragments stored in the node by concatenation.
  3. Once you reach a leaf node OR if you hit a word-indicator (represented above as an asterisk), you have completed a word.
Our new node in standard UML:

So that's it. With this node definition and the above algorithm, we have accomplished a lot.

Summary
  1. We have explained the Dictionary Trie data structure.
  2. We have explained a simple algorithm for retrieving words from this trie.
So what are the advantages of this data structure?
  1. Efficient space storage
  2. Word lookup by prefix
Examples of this second advantage abound. When you start typing in your browser and the browser gives you auto-complete, something like the dictionary trie is being used. Code editors with auto-complete probably use tries too, though augmented with semantic information.

I decided to use the idea for a dictionary trie when writing a boggle solver many years ago trying to learn Common Lisp. My problem was to search a random matrix of characters (the boggle board) for words. A na├»ve recursive search starting at each character in the matrix and crawling through the possibilities blows up exponentially. A quick solution to this is to stop crawling for possible words after you have hit a prefix string that does not exist in the dictionary. (Why search for words of the form "ggrxz-" if you know "ggrxz" is not a prefix of any word? "Ggrxzanacious" is a great word though.)

We have shown what a trie is, and how to retrieve words from it. Next is to show how to create one, and to start defining its capabilities (how to add words, delete words, etc).




No comments:

Post a Comment