Probably Done Before: May 2010

Purpose
This post is about an algorithm for creating a trie from a set of words.

To do this, I am proposing instead to come up with an algorithm to add one word to an already existing trie, and to treat the creation of a trie as a series of add operations (with an empty trie as the degenerate case).

To understand what a dictionary trie is and what its advantages are, see my previous post (A Dictionary Trie). Or google it.

Onward

Let us use the trie we've already generated from the previous post, and consider adding the following new words:

real

relied

rebates

reb

"reb" is probably not a real word. Just pretend it is and move on.

These words were chosen to highlight all the conditions we could run into when we add a word to an existing trie:

"real" will split a node and extend the branch (with the growth internal).
"relied" is probably the trickiest. It will extend the branch in two directions.
"rebates" will split a node and extend the branch (with the growth at the leaf node and outwards).
"reb" will switch an indicator to TRUE.

Adding "real" (insertion string is prefix of target string)

In this scenario "real" will split a node and extend the branch (with the growth internal).

The above image shows that we are inserting a new node internal to the branch and extending the branch. BE CAREFUL -- we are not just moving the old node down, we have also modified the characters it contains within (this distinction is important when doing functional language modifications).

The above diagram shows my convention of coloring the new nodes grey, and dotting the new edges to the tree graph. A dashed node is an old node with a modified word-fragment.

Algorithm
We enter the "alize" node with our insertion string as "al" (to understand this, realize that we had started at the top node with our insertion string as "real" but as "re" was the word fragment contained in the top node, we removed that prefix from our insertion string and submitted the new insertion string "al" into the sub-node).

Anyways, to repeat, we enter the the "alize" node with our insertion string as "al". We see that our insertion string is the prefix of our target string. At this point we create a new node here. We make "al" the word-fragment of our new node, and we turn its word-indicator to ON. We shift the old node down and change its word-fragment by removing the prefix of our "al" insertion string. Hence, "ize" is the shifted node, still pointing to all its previous children.

Quick Summary:

if ( insertion string is prefix to target string) then "extend branch internally"
   where "extend branch internally" = use the target string as 
                                      a new node's word-fragment, 
                                      and point this new node to the 
                                      old node with shortened word-fragment.

Adding "relied" (insertion string and target string share a prefix)

In this situation "relied" will extend the branch in two directions. This is the trickiest of our scenarios.

Algorithm
Here we enter the node with our insertion string as "lied" and our target string as "lief". Unlike the previous algorithm, the insertion string is not a prefix of the target string. Nor is the converse true, "lief" is not a prefix of "lied". But, we do notice that both the insertion and target string share a prefix.

So now, we create two new nodes, one with the shared prefix ("lie"), the other with the suffix of the insertion string ("d"). We point the shared prefix node to the insertion suffix node and to the old node, making sure to modify the old node to change its word-fragment to the target string suffix ("f").

Quick Summary:

if ( insertion string and target string share prefix) then "create fork"
   where "create fork" = 1) use the target string shared prefix as a new node's word-fragment,

                         2) use the target strings suffix as a new node's word-fragment, now point #1 to this one

                         3) point this new node to the old node with shortened word-fragment.

Adding "rebates" (target string is prefix of insertion string)

In this situation "rebates" will extend the branch (with the growth outwards at the leaf node).

Algorithm
Here we enter the node with "bates" as our insertion string and "bate" as our target string. We notice immediately that the target string is a prefix of the insertion string.

At this point, we take the subtracted suffix of our insertion string "s" and see if there is a child for this. If not, we add it to the graph. We create a new node with a word-fragment of "s" and point the current node to this new node (make sure to turn on its word-indicator).

Adding "reb" (target string is equal to insertion string)

In this "reb" will switch an indicator to TRUE. This is probably the easiest of our scenarios, since nothing what we do modifies the structure of the tree. All we are doing is recognizing that a node that was previously an intermediate non-word node, has now become a word node.

Algorithm
Here we enter the node and see that our target string "b" and our insertion string "b" are the same. All we do is switch the word-indicator to ON if it was not already on. If it was already on, then I think that proves that this word has already been inserted. Right?

Finally (or is it?)

After all these words have been added, and we have:

where this tree shows the transformations that took place.

Remember, my arbitrary legend:

a grey node is brand new with a dotted arrow showing a new edge
a dashed node is a modification to a node (whether changing the word fragment or switching an indicator on)

Let's let this thing cool down, and we get our final tree:

But Are We Done?

At the top of this post we said that creation of a trie should be treated as repeated applications of insertion, and if we want to use these collected algorithms, we will need to do one more thing and that is define a starting trie.

So let's define the degenerate case.

Degenerate Trie (aka Empty Trie, aka the Top Node)

Our degenerate trie should be empty. It should be the prefix of all words but not actually have a word-fragment, (or if you prefer, it's word-fragment should be ""). In any case, this is analogous to how regular expressions have beginning of line characters (^) or beginning of word characters (\b) -- anchors.

The key to using this degenerate case is to make sure that any `isPrefix` operator in any programming language will treat this as a prefix to all words. If not, we can always add an is-top indicator to our data structure and modify our algorithms to check prefixes and whether the node is the top node. We'll discuss that later.

Now that we have defined a degenerate trie, we can see what our first insertion will look like:

Bully!

Our final tree upon repeated applications of insertion will be:

yay!

Next Steps

In our first post on dictionary tries, we discussed 1) what a trie was, 2) how to retrieve words from it, and 3) what its advantages were.

This post focused on how to create that trie from insertions. We got two things at once.

For our next step we could go many different ways. Perhaps we could write delete code? Maybe we need to start implementing this in some programing language? Is there anyway to create an extremely basic algebra for tries (to add and subtract tries from each other)?

If you were given the following words:

rebate

reborn

realize

relief

realizes

redder

red

You will immediately notice that they all share the same prefix.

So now if I asked you to create a data structure to store these words (don't worry about why yet), their shared prefix should get you thinking about trees.

So let us create a tree.

As you note, the algorithm to retrieve the words from this tree is simple:

Retrieval Algorithm

Start at the root and walk down to each end node.
As you walk down, accumulate the string fragments stored in the node by concatenation: "re" + "bate" yields "rebate".
Once you reach a leaf node, you have completed a word.

If we were to program this data structure, it would consist of a single node structure with pointers to other nodes (as children). The following psuedo-code shows this.

Let's continue refining our tree structure. First thing you will note is that the above tree is only one level deep.

This highlights the obvious need to continue generating prefixes down the tree. For instance, the fragments "bate" and "born" share a prefix of a "b", and "alizes" and "alize" share "alize".

So let's grow branches to our tree:

Now this data structure is more efficiently storing prefixes than the previous. But does our algorithm to retrieve words still work? (Hint: no. (oops, that wasn't a hint))

Flawed Retrieval Algorithm

Start at the root and walk down to each end node.
As you walk down, accumulate the string fragments stored in the node by concatenation.
Once you reach a leaf node, you have completed a word.

So what fails here?

Step 3 does. Our new tree has an entire word as a prefix and we would reach a word before we reached a leaf node.

The word "realize" is a prefix to the word "realizes" but our third step in the algorithm would skip it. Same with "red" and "redder".

Solution? Easy. Let's add extra information to each node. We will call this boolean information a word-indicator:

New Retrieval Algorithm

Start at the root and walk down to each end node.
As you walk down, accumulate the string fragments stored in the node by concatenation.
Once you reach a leaf node OR if you hit a word-indicator (represented above as an asterisk), you have completed a word.

Our new node in standard UML:

So that's it. With this node definition and the above algorithm, we have accomplished a lot.

Summary

We have explained the Dictionary Trie data structure.
We have explained a simple algorithm for retrieving words from this trie.

So what are the advantages of this data structure?

Efficient space storage
Word lookup by prefix

Examples of this second advantage abound. When you start typing in your browser and the browser gives you auto-complete, something like the dictionary trie is being used. Code editors with auto-complete probably use tries too, though augmented with semantic information.

I decided to use the idea for a dictionary trie when writing a boggle solver many years ago trying to learn Common Lisp. My problem was to search a random matrix of characters (the boggle board) for words. A naïve recursive search starting at each character in the matrix and crawling through the possibilities blows up exponentially. A quick solution to this is to stop crawling for possible words after you have hit a prefix string that does not exist in the dictionary. (Why search for words of the form "ggrxz-" if you know "ggrxz" is not a prefix of any word? "Ggrxzanacious" is a great word though.)

We have shown what a trie is, and how to retrieve words from it. Next is to show how to create one, and to start defining its capabilities (how to add words, delete words, etc).

Dictionary Trie Manipulation (Insertion)

A Dictionary Trie