Dictionary Trie Manipulation (Insertion)

This post is about an algorithm for creating a trie from a set of words.

To do this, I am proposing instead to come up with an algorithm to add one word to an already existing trie, and to treat the creation of a trie as a series of add operations (with an empty trie as the degenerate case).

To understand what a dictionary trie is and what its advantages are, see my previous post (A Dictionary Trie).  Or google it.

Let us use the trie we've already generated from the previous post, and consider adding the following new words:


"reb" is probably not a real word. Just pretend it is and move on.

These words were chosen to highlight all the conditions we could run into when we add a word to an existing trie:

  1. "real" will split a node and extend the branch (with the growth internal).
  2. "relied" is probably the trickiest. It will extend the branch in two directions.
  3. "rebates" will split a node and extend the branch (with the growth at the leaf node and outwards).
  4. "reb" will switch an indicator to TRUE.
Adding "real"  (insertion string is prefix of target string)

In this scenario "real" will split a node and extend the branch (with the growth internal).

The above image shows that we are inserting a new node internal to the branch and extending the branch.  BE CAREFUL -- we are not just moving the old node down, we have also modified the characters it contains within (this distinction is important when doing functional language modifications).

The above diagram shows my convention of coloring the new nodes grey, and dotting the new edges to the tree graph.  A dashed node is an old node with a modified word-fragment. 

We enter the "alize" node with our insertion string as "al" (to understand this, realize that we had started at the top node with our insertion string as "real" but as "re" was the word fragment contained in the top node, we removed that prefix from our insertion string and submitted the new insertion string "al" into the sub-node).

Anyways, to repeat, we enter the the "alize" node with our insertion string as "al".  We see that our insertion string is the prefix of our target string. At this point we create a new node here.  We make "al" the word-fragment of our new node, and we turn its word-indicator to ON.  We shift the old node down and change its word-fragment by removing the prefix of our "al" insertion string.  Hence, "ize" is the shifted node, still pointing to all its previous children.

Quick Summary:
if ( insertion string is prefix to target string) then "extend branch internally"
   where "extend branch internally" = use the target string as 
                                      a new node's word-fragment, 
                                      and point this new node to the 
                                      old node with shortened word-fragment. 
Adding "relied" (insertion string and target string share a prefix)

In this situation "relied" will extend the branch in two directions.  This is the trickiest of our scenarios.

Here we enter the node with our insertion string as "lied" and our target string as "lief".  Unlike the previous algorithm, the insertion string is not a prefix of the target string.  Nor is the converse true, "lief" is not a prefix of "lied".  But, we do notice that both the insertion and target string share a prefix.

So now, we create two new nodes, one with the shared prefix ("lie"), the other with the suffix of the insertion string ("d"). We point the shared prefix node to the insertion suffix node and to the old node,  making sure to modify the old node to change its word-fragment to the target string suffix ("f").

Quick Summary:
if ( insertion string and target string share prefix) then "create fork"
   where "create fork" = 1) use the target string shared prefix as a new node's word-fragment,
                         2) use the target strings suffix as a new node's word-fragment, now point #1 to this one
                         3) point this new node to the old node with shortened word-fragment. 

Adding "rebates" (target string is prefix of insertion string)
In this situation "rebates" will extend the branch (with the growth outwards at the leaf node).

Here we enter the node with "bates" as our insertion string and "bate" as our target string.  We notice immediately that the target string is a prefix of the insertion string.

At this point, we take the subtracted suffix of our insertion string "s" and see if there is a child for this.  If not, we add it to the graph.  We create a new node with a word-fragment of "s" and point the current node to this new node (make sure to turn on its word-indicator).

Adding "reb" (target string is equal to insertion string)
In this "reb" will switch an indicator to TRUE.  This is probably the easiest of our scenarios, since nothing what we do modifies the structure of the tree.  All we are doing is recognizing that a node that was previously an intermediate non-word node, has now become a word node.

Here we enter the node and see that our target string "b" and our insertion string "b" are the same. All we do is switch the word-indicator to ON if it was not already on.  If it was already on, then I think that proves that this word has already been inserted. Right?

Finally (or is it?)
After all these words have been added, and we have:
where this tree shows the transformations that took place. 

Remember, my arbitrary legend:
  • a grey node is brand new with a dotted arrow showing a new edge
  • a dashed node is a modification to a node (whether changing the word fragment or switching an indicator on)
Let's let this thing cool down, and we get our final tree:

But Are We Done?

At the top of this post we said that creation of a trie should be treated as repeated applications of insertion, and if we want to use these collected algorithms, we will need to do one more thing and that is define a starting trie.

So let's define the degenerate case.

Degenerate Trie (aka Empty Trie, aka the Top Node)

Our degenerate trie should be empty.  It should be the prefix of all words but not actually have a word-fragment, (or if you prefer, it's word-fragment should be "").  In any case, this is analogous to how regular expressions have beginning of line characters (^) or beginning of word characters (\b) -- anchors.

The key to using this degenerate case is to make sure that any `isPrefix` operator in any programming language will treat this as a prefix to all words.  If not, we can always add an is-top indicator to our data structure and modify our algorithms to check prefixes and whether the node is the top node.  We'll discuss that later.

Now that we have defined a degenerate trie, we can see what our first insertion will look like:


Our final tree upon repeated applications of insertion will be:


Next Steps

In our first post on dictionary tries, we discussed 1) what a trie was, 2) how to retrieve words from it, and 3) what its advantages were.

This post focused on how to create that trie from insertions.  We got two things at once.

For our next step we could go many different ways.  Perhaps we could write delete code?  Maybe we need to start implementing this in some programing language?  Is there anyway to create an extremely basic algebra for tries (to add and subtract tries from each other)?  

No comments:

Post a Comment