I think I finally grok it

Sometime it's so much easier to make your own diagrams.  The arrow represents the "is a" relationship with labels to specify.

I left the crazy cousin "homeo" out of the family tree, cause he's just too topological.

Clojure Implementation of Dictionary Trie

I finally have Clojure
This is the Clojure implementation of the Dictionary Trie previously implemented in Haskell.  You can copy and paste it into a repl, or load it into a file.  A basic usage goes:
      ["rebate" "reborn" "realize" "real"])
or for fun
(println (generate-pretty
          ["rebate" "reborn" "realize" "real" "relied"])))
to get the Dot
"_"->"r_"[label=" r"]
"r_"->"lr_"[label=" l"]
"r_"->"ar_"[label=" a"]
"r_"->"br_"[label=" b"]
"ar_"->"iar_"[label=" i"]
"br_"->"obr_"[label=" o"]
"br_"->"abr_"[label=" a"]
and copy those results here to see the image:

  1. defmulti can be used to mimic Haskell's constructor/structure based dispatch
  2. Haskell has trained me to favor "higher-order" over macros.  Higher-orders like partial, comp, etc. can do a lot of what you need.
  3. Looking forward to Clojure 1.2's defrecord capabilities as I have become a fan of abstract/algebraic data types
  4. Clojure has gotten a lot of mileage with the 80% of data structures we all use (in their immutable form): vector, hashmap, list.  The upcoming defprotocol will allow the hard work for that last 20%
  5. as much as I love Haskell, you can't beat a quick (println) for debug.
  6. It was easier to mentally map the Haskell code to the Clojure code rather than re-reason the problem through.  
  7. I like recur over hoping for a compiler TCO. 
  8. Types do help debug.  I found a lot of quirky edge-cases in testing Clojure run-time.  Hard to put my finger on individual anecdotal evidence... more of a "feel thing".
  9. Just as in the Haskell implementation,  I don't think there's any great advantage to be gained in using a zipper data structure.  I will try though, mainly to familiarize myself with the zipper.
API Methods
  • generate-trie
  • query-node
  • add-word-to-trie

Implementing the Dictionary Trie: Insertion and Search Algorithms: Haskell

In a previous post, I settled on a data structure for the Dictionary Trie.  In this post, I present the Haskell code I used to implement the search and insertion algorithms.   To understand what motivates the dictionary trie, make sure to read the following posts (including an informal discussion on the algorithms used in the insertion):
  1. What is a dictionary trie?
  2. How we would go about creating a dictionary trie from inserts.
  3. What is a basic Haskell data structure for a dictionary trie.
The Data Structure

  1. Algebraic
  2. Parametrically Polymorphic
As I had mentioned on my previous post about refining this data structure, the TrieNode is the essential component to the tree.  I have used an enumeration type (NodeType) to indicate the node type rather than a set of booleans, in the belief that Haskell's pattern matching could make the code that much easier to understand.
The Insertion Algorithm
The insertion algorithm exposes one public API function:

  • addWordToTrie
which takes a string (actually an array of the parametrically typed value) and the dictionary trie. 
There are several utility functions that I do not export out of the module (none of the module code at the top is shown here).  These utility functions use internal utility data structures that are concerned with the type of match that two strings have towards each other -- the source string is always compared against a target string and there's always one of five different matches:  
  • exact match
  • target is prefix of source
  • source is prefix of target
  • shared prefix
  • nothing in common
In the below code, I have inserted a graphic image as a  visual description of the algorithm for the four different cases that you would actually run into during the insertion.  The utility function insertNode is a direct mapping (and formal encoding) of the informal scenarios discussed in this previus post. The correspondence to the informally developed algorithms and the actual functions was quite pleasing, and was made possible by the pattern-matching capabilities of the language

  Exact Match

Target Is Smaller Than Source

Source Is Smaller Than Target

Source and Target Share Same Prefix

The Querying Algorithms

These are the subjects I want to touch on in my creation of the above code:
  1. Keeping things functional with record syntax
  2. Just Maybe stuff
  3. Potentially too redundant in spots  
  4. my first use of home-grown operators
  5. does pattern matching on record syntax get me that much?
  6. no magic
  7. functionally passing back self
Keeping Thing Functional with Record Syntax
In order to make a 'change' in the functional world, what you have to do is actually make a new value.  For values that are aggregates of others, like our data structure, you need to make sure that the old values for the other stuff remains the same, so essentially you 'copy' the other values.  I was extremely pleased to see that with the record syntax, you could code such that a 'modification' to one field in the record affected the overall change leaving the rest of the fields the same as previous.  This required the use of Haskell's pattern-matching to name the overall structure/node.  
Just Maybe Stuff
I know that I am not properly understanding the way for dealing with Maybe values.  I feel that there has to be a better way for extracting the underlying value than the where-clause utility functions I use. Does this reflect an underlying misapprehension of the pholosophy of Maybe values? Maybe.
Potentially Too Redundant
I had two internal data structures to represent the matching of the source and target strings.  One was purely a description of the matching that occurred (the type MatchType) and consisted of five nullary constructor data types.  The other redundantly captured the state of the match but also encapsulated the prefix-suffix pair that resulted.  I have a strong suspicion I could have done this with one pass.  I will revisit.
First Use of Homegrown Operators
This is an issue of terseness and psychology.  Once I had the algorithms developed and everything working according to plan, I refactored some prefix-oriented functions into operators.  I turned:

  • insertNewChildNode into *->
  • takeTogetherWhile into =><=
In the end, this reduced the camelCaseClutterAndExcessinveJavaLikeNaming, but as these functions were internal there probably was no real gain.
Pattern Matching Cost Benefit Analysis
The record syntax (as a data constructor) permits pattern matching on the left-hand side of function definitions or on case expressions.  There were several times I wondered whether the use of the pattern-matching was really worth it.  While it did lower the terseness on the right-hand side (no need to fetch the data out of the record with 'accessor' functions), it did raise the level of noise on the left-hand side.
No Magic
As I've been diving deep into Haskell, I've been itching to try out the magic that provides the allure to this language.  I have been attracted to function composition and currying.  I've desired to chain together functions in a point-free style or employ functors or applicative style idioms.  I found that my code showed little of this magic.  The reasons could be (and/or):

  • I am new to a different paradigm and still thinking idiomatically imperative
  • I am dealing with programming in the large which is really just massaging, manipulation, and transformation of record syntaxes.
Functionally Passing Back Self
The utility function insertNode is the heart of the insertion algorithm and is tree-recursive.  As mentioned above, the functional nature of Haskell requires that in tree updates, we are actually doing tree 'copies'  (the underlying implementation may share state to increase performance, but as far as the language semantics go, the values should all be new and copied).  To accomplish this, functions which are tree recursive have to always do the same thing:  
  • they need to pass back the node they are currently operating on as the result of the function
  • and assume that the invocations on themselves (the recursive call) are 'modifications' to the child nodes
This grew tedious after a while, and felt like unnecessary boilerplate.  It was stuff like this that drew the development for higher-order functions like fold and map, and I considered exploring the benefit of casting the Trie as a functor, or of implementing a zipper data structure.  
I will revisit this.

Another Digression: Tries for Capturing Pronunciation

Those linguist guys came up with an alphabet of symbols to try to capture the totality of human sounds as a sequence of characters. They called this alphabet IPA.  Here is the IPA for the word 'antediluvian'

 \ˌan-ti-də-ˈlü-vē-ən, -(ˌ)dī-\

I just had an idea that maybe we can use words of IPA pronunciation in a trie and it might give us some useful stuff.  For instance,  'does one word sound like another?', both at the front and at the end -- 'friendship' and 'phrenology' sound alike near the front (at least, according to me), and 'cone' and 'loan' have a similar ending sound (i.e. they rhyme -- anytime).

I am going to make a lot of assumptions, but as I am not attempting to be a linguist (not today), I can get away with doing some cool stuff while waving my hand with the pronouncement that this is just an approximation.  The title of this blog is probably done before.  I am sure that someone has done this before, but I am not going to google it -- I want to have fun exploring this on my own.

  1. Find a dictionary of IPA words.  (Strictly speaking I need to find two parallel dictionaries: an IPA dictionary along with the script word: a mapping from words of an IPA language (ˌan-ti-də-ˈlü-vē-ən, -(ˌ)dī-) to words of a written language (antediluvian).
  2. Insert these IPA words into two tries:  one with the constituent IPA characters from left to right, and one right to left.  The first tree should tell me 'friendship' and 'phrenology' share a pronunciation prefix -- even though they don't share the same written-word prefix.  The second trie should tell me that 'loan' and 'cone' rhyme (despite having different suffixes).  Here is where my hand-waving should come in handy.
This should be fun, and I am sure I am going to run into a LOT of problems.

    Dictionary Trie Digression: Auto-Generating Those Pretty Graphs

    I am Lazy (with a capital L)
    Some say laziness is a hallmark of a programmer.

    Did you think I hand drew this tree with Visio or Office Draw ?
     Well, no.

    As I started to post a couple of weeks ago, I realized that I needed to show tree structures early and often.  Originally, I was doing something silly like this:
        |    |
        |    |
        |    ---"l"
        |    |
        |    |
        |    ---"r"
    but then blogspot mangled it, and I realized it wasn't as pretty as a graphic like the one above.  I figured to myself that there MUST be a simple domain specific language for rendering simple graphs/trees based on a text.  Ideally, something like this:
    a -> b -> c
         b -> d
    should generate something like this:
    Well, guess what? This is exactly a language, and the above snippet was used to generate this image.  The language is called Dot and is declarative and simple.  A package called Graphviz provides a nice set of free tools for generating images from a Dot file.

    I Continued Being Lazy
    I did not want to install these free tools.  I figured, rightly so, that somewhere, somebody had probably wrapped the Graphviz tool-set with HTTP.  O brave new world of web-services, ajax and SOA!  A little googling and I found my savior:  http://graph.gafol.net/.  And so now, I had everything I needed.  Time to stop being lazy.

    Laziness Tabled: Utility Functions for Rendering Tries to Dot
    So back in a previous post I had come up with my final (for now) data structure for a Trie (as an inductively generated set of nodes).

    Here it is:
    data NodeType = Top|Word|NonWord 
                    deriving (Show,Eq)
    data TrieNode a = TrieNode { wordFragment :: [a], 
                                 children :: (Map.Map a (TrieNode a)), 
                                 nodeType:: NodeType }  deriving (Show,Eq)

    With only this structure in mind, it is possible to write the code to generate the dot
    language. Here is the code:

    data Dot = DotPath {fid::String, tid::String} | 
    instance Show Dot where
        show DotPath{fid=from,tid=to@(x:_)} = "\"" ++ from ++ "\"" 
                                              ++  "->"
                                              ++ "\"" ++ to ++ "\""
                                              ++ "[label=\" " ++ ( x : "\"]")
        show DotNode{dotId=nodeId,label=outputLabel} = "\"" ++ nodeId ++ "\"" 
                                                    ++  "[label= \"" ++ outputLabel ++ "\"]" 
    generateDotLabel nodeDatum = case (nodeType nodeDatum) of 
                        Top -> "∅"
                        Word ->  wf ++ "(*)" 
                        NonWord ->   wf 
         where wf = (wordFragment nodeDatum)
    generateDot :: TrieNode Char -> String -> [Dot]
    generateDot nodeDatum path = myDotNode : myPaths ++ myChildrensDots 
        where myDotNode = DotNode{dotId= path,label=(generateDotLabel nodeDatum)}
              myPaths = map generatePathNode $ (Map.keys  (children nodeDatum))
              myChildrensDots = foldr (++) []  $
                                      map generateChildrenDots $ (Map.assocs  (children nodeDatum))
              generatePathNode x = DotPath{fid=path,tid=x:path}
              generateChildrenDots (k,v) = generateDot v (k:path)
    generateDotPretty tree = L.intercalate "\n" $ map show  $ generateDot tree "_"

    When I ran it on the Trie generated by the following words:
    testWords = ["rebate",

    I get this:
    "_"[label= "∅"]
    "_"->"r_"[label=" r"]
    "r_"[label= "re"]
    "r_"->"ar_"[label=" a"]
    "r_"->"br_"[label=" b"]
    "r_"->"dr_"[label=" d"]
    "r_"->"lr_"[label=" l"]
    "ar_"[label= "al(*)"]
    "ar_"->"iar_"[label=" i"]
    "iar_"[label= "ize(*)"]
    "iar_"->"siar_"[label=" s"]
    "siar_"[label= "s(*)"]
    "br_"[label= "b(*)"]
    "br_"->"abr_"[label=" a"]
    "br_"->"obr_"[label=" o"]
    "abr_"[label= "ate(*)"]
    "abr_"->"sabr_"[label=" s"]
    "sabr_"[label= "s(*)"]
    "obr_"[label= "orn(*)"]
    "dr_"[label= "d(*)"]
    "dr_"->"ddr_"[label=" d"]
    "ddr_"[label= "der(*)"]
    "lr_"[label= "lie"]
    "lr_"->"dlr_"[label=" d"]
    "lr_"->"flr_"[label=" f"]
    "dlr_"[label= "d(*)"]
    "flr_"[label= "f(*)"]

    You can copy this into  http://graph.gafol.net/ and see that it generates the image at the top of this post.

    Some Observations

    1. Dot Id's
    2. instances of type classes
    3. pattern matching
    4. generate dot can't be used on any types (note the type) -- or 'does not play well with arbitrarily contained types'
    5. probably not efficient
    6. use of a wrapper data structure
    Dot Id's
    I could not use the word fragment itself as the id, otherwise we would introduce nodes with multiple parents (an acyclic graph). The Dot language allows me to use any arbitrary sequence of characters as the id, so long as they are surrounded by quotations.  So, I generated the ids from the path down to the node, pushing the character key onto the path as we descended recursively -- initializing the path with a bogus start character of an underscore '_'.  You can verify that the id's are in the reverse order of the descending path.  The word fragments are used, of course, to render the label of the node.

    Instances of Type Classes
    This was the first time I overrode show and purposely put my data structure (the TrieNode) into a type class (namely Show).   As I have read the literature, type classes are Haskell's unique contribution to the functional community, (aside from assembling everything together in one nice package).  Type classes allow for a disciplined framework for ad-hoc polymorphism.  Another disciplined framework for ad-hoc polymorphism is subclassing in Java and overriding a method name.  I might revisit overriding show ( using the (++) concatenation operator), but, for now, this is how I pretty printed the Dot commands.

    Pattern Matching
    For those following along (not just me), that here is the justification in my earlier post for pattern matching with a contained data type kicks in.  The function generateDotLabel uses a case expression to match against the nullary data constructors of NodeType.  In essence, this is a nice symbolic switch statement.  There are plenty of other pattern matches going on -- I particularly like the deconstruction of a List as (x:xs).  You can see this at work in my definition of show for the DotPath.  Here is where understanding left from right comes in handy.  The (x:_) on the left hand side of the equation is a pattern match, matching the x to the first element of the list and using the underscore to disregard the tail.  On the right hand side of the equation, as I am assembling the string, we see (x: path).  This is not a pattern match.  This is an insertion operation.  They look alike. They are not.  Pay attention to what side of the equation you are on.

    Does Not Play Well with Arbitrary Contained Types
    In an earlier post I stated the desire to make the TrieNode parametric.  I have made it parameteric.  See the type parameter 'a' in the data declaration below.
    data NodeType = Top|Word|NonWord 
                    deriving (Show,Eq)
    data TrieNode a = TrieNode { wordFragment :: [a], 
                                 children :: (Map.Map a (TrieNode a)), 
                                 nodeType:: NodeType }  deriving (Show,Eq)
    However when writing the code for generating the Dot, it became clear to me that unless I added type constraints ( Show a => ) to my data declaration, then I could not guarantee that my generating functions could render to Dot.  I enforced this by explicitly declaring the type for the main function for rendering.

    generateDot :: TrieNode Char -> String -> [Dot]
    This is something I plan on revisiting as it is possible to render the 'structure' but auto-generate the labels. Gotta think about that.

    Probably Not Efficient
    Yup. I am generating an array of Dot data types from a recursive operation.  More efficiently I should have had an accumulator that enters each sub-frame of the call-stack.  I am not sure I am inefficient.  I just suspect it.  The main function is (the where clause is left off):

    generateDot nodeDatum path = myDotNode : myPaths ++ myChildrensDots

    which just looks like it might not be TCO, what with the concatenation operations and myChildrensDots representing the results of the recursive calls.  On the other hand I am not sure I understand enough of Haskell's laziness to say whether this is really that that bad.

    New Wrapper Data Type
    Originally I had written this code as a series of string concatenation operations as I traversed the tree.  This grew ugly.  I introduced the Dot data structure with its two data constructors DotNode and DotPath to capture everything I need (for my cases) in a single Dot output line.  The process of rendering these data structure as strings is left for the overridden show instance and for a series of cool chained list functions:

    generateDotPretty tree = L.intercalate "\n" $ map show  $ generateDot tree "_"

    Luckily the Dot language is declarative and requires very little sequencing guarantees (i.e. you can put these commands in whatever order).

    Implementing the Dictionary Trie: Refining (Redefining) the Data Structure

    When learning a new programming language, sometimes analysis paralysis occurs.  Having read through many Haskell tutorials, I have become reacquainted with the novice problem (and the psychological confusion) of how to model.  I have learned a new set of features. Now, should I use type classes, parametric polymorphism, sum algebraic data types?  (Mind you, I am only just learning the theory behind some of these terms).

    I am reminded of learning OO with Java, and of not knowing when to use interfaces or abstract classes, or when to favor inheritance over composition. Familiarity and experience eventually affords you an artistic eye (and a mental model of the constraints of that language) to see just exactly what features of the language should be used.  Obviously (being new to Haskell), I have not yet developed this sense.

    Tangent:  I am finding it even more difficult to come up with a mental model in Scala, which combines OO with functional programming.  Scala has a lot of tools: traits, absract classes, case classes, abstract type members, generic type parameters, singleton objects.  With Clojure, on the other hand, it seems I can only use hashmaps -- (perhaps this is an advantage?)  Although defprotocol and deftype might be a response for named models.

    I Have Cheated
    So the purpose of this post was to show a refinement of the data structure.  I tried to do it through intuition alone (posting simultaneously as I figured things out), but in the end I went ahead and implemented the search and insertion algorithms in order to produce the following refinement.

    Let's review. From the previous post:
    data TrieNode  = TrieNode { wordFragment :: String, children :: Map.Map Char TrieNode
                                isTop :: Bool, isWord :: Bool} 
    data NodeType = Top|Word|NonWord 
                    deriving (Show,Eq)
    data TrieNode a = TrieNode { wordFragment :: [a], 
                                 children :: (Map.Map a (TrieNode a)), 
                                 nodeType:: NodeType }  deriving (Show,Eq)

    Balancing Flexibility and Over-Engineering
    Classic problem.  Time, elegance, maintainability, desire to learn -- all can make something too flexible (over-engineered), or too rigid (under-engineered).

     If we are designing a dictionary trie, why not try to make it abstract and reusable?  Why not spend the extra effort to generify the data and the functions so that it can be used in multiple scenarios?

    If the above paragraph was too abstract itself, consider the following questions.  Should our dictionary trie only be used for prefixes of actual human language words?  Or should we use the word word, abstractly, and permit an arbitrary alphabet, from bits to anything else?  Could we use our dictionary trie to store IP routing information, for instance?

    Here is where I find Haskell's parametric polymorphism very compelling:  I had already written the insertion, and search algorithms with the String-typed data declaration.  I allowed Haskell's type inference to take care of most of my functions inasmuch as I did not declare function types.  Upon deciding to try to be a bit more abstract, I changed all references to String to [a] and all references of Char to a, and it almost worked immediately -- (almost, because I had some debug functions which invoked the functions with strings and therefore allowed the compiler's inference algorithm to cascade the String assumption).

    Horse before the cart.  Let's review the new code
    data NodeType = Top|Word|NonWord 
                    deriving (Show,Eq)
    data TrieNode a = TrieNode { wordFragment :: [a], 
                                 children :: (Map.Map a (TrieNode a)), 
                                 nodeType:: NodeType }  deriving (Show,Eq)

    So what's new here?

    1. Generic data types (parametric polymorphism applied to the type constructor) -- i.e. the use of a type parameter 'a' 
    2. The use of an enumeration type NodeType (here's where I get to apply my jargon : it's a sum type of nullary data constructors) 
    The first point, (the use of a type parameter 'a'), has given us the advantage of genericity.  The second point, (using a new data type to enumerate states of the node), is mainly for clarity but it also gives us a slightly better ability to pattern match in our insert and search code.

    To prove this last point, we will have to start developing the algorithms we discussed in previous posts for search and insertion.

    Implementing the Dictionary Trie: Defining the Data Structures


    In this post, we are going to start implementing the dictionary trie.  To start this we will design the data structure. Below is a UML-ish representation of  a TrieNode

    Because I want to learn some new (to me) functional languages (HaskellClojure, and Scala), and this blog is more for my benefit than anything else, we will use Haskell in this post.  

    I won't go over the benefit of functional languages here.  There are many other posts/articles that do a better job explaining/selling functional languages over other paradigms.

    Quick Review of Trie Node

    We came up with this UML'is representation of a single node of our data structure back in the very first post.
    1. The "wordFragment" stores the substring of the word at that point in the trie.  For leaf nodes, these word fragments are guaranteed to be suffixes to a word in the dictionary.
    2. The "children" is a set of pointers to other TrieNodes. It is this aspect that allows this single node to become a tree. Do not read too much into the array notation.
    3. The "isWord" indicator is needed to indicate if a non-leaf tree is a word end (although we will use it in leaf nodes as well).  All leaf nodes are guaranteed to be word ends.  
    4. The "isTop" indicator is used only for the special degenerate top-level of the trie.  Please review the second post Dictionary Trie Insertion for the explanation of why this trie is needed.
    Haskell Implementation First Draft
    data TrieNode  = TrieNode { wordFragment :: String, children :: [TrieNode] 
                                isTop :: Bool, isWord :: Bool} 

    This is absolutely sufficient for implementing everything required.  We have here a Haskell record syntax which provides us some syntactic sugar and some autogenerated functions (loosely akin to "getters" for people coming from an OO background).

    In order to initialize a top node, we would:
    topNode  = TrieNode { wordFragment ="" , children = [] , isTop = True, isWord = False} 

    But let's consider some details.

    Walking Down the Tree
    Remember this tree (from the second post)?

    We can see the arrows leading down from the prefixes to the end nodes. Walking down on the left hand side, we get: "re" + "al" + "ize" + "s" =  "realizes". But there is something missing (in both this diagram and in our Haskell record syntax) in describing how we represent which arrow points to which node.

    We need something more like this:

    which shows that we can access our sub-nodes by assigning the first character in the child node's "wordFragment" to the edge.

    So let's change our code a bit, and instead of using a a list to represent the children in our node, we will use a Map (with a capital M, to remove ambiguity with the function 'map').
    import qualified Data.Map as Map
    data TrieNode  = TrieNode { wordFragment :: String, children :: Map.Map Char TrieNode
                                isTop :: Bool, isWord :: Bool} 

    There is nothing wrong with a list.  It would be just inelegant to find the appropriate sub-child by searching through the list, when a hash lookup based on the first character would get us better performance.

    Dictionary Trie Manipulation (Insertion)

    This post is about an algorithm for creating a trie from a set of words.

    To do this, I am proposing instead to come up with an algorithm to add one word to an already existing trie, and to treat the creation of a trie as a series of add operations (with an empty trie as the degenerate case).

    To understand what a dictionary trie is and what its advantages are, see my previous post (A Dictionary Trie).  Or google it.

    Let us use the trie we've already generated from the previous post, and consider adding the following new words:


    "reb" is probably not a real word. Just pretend it is and move on.

    These words were chosen to highlight all the conditions we could run into when we add a word to an existing trie:

    1. "real" will split a node and extend the branch (with the growth internal).
    2. "relied" is probably the trickiest. It will extend the branch in two directions.
    3. "rebates" will split a node and extend the branch (with the growth at the leaf node and outwards).
    4. "reb" will switch an indicator to TRUE.
    Adding "real"  (insertion string is prefix of target string)

    In this scenario "real" will split a node and extend the branch (with the growth internal).

    The above image shows that we are inserting a new node internal to the branch and extending the branch.  BE CAREFUL -- we are not just moving the old node down, we have also modified the characters it contains within (this distinction is important when doing functional language modifications).

    The above diagram shows my convention of coloring the new nodes grey, and dotting the new edges to the tree graph.  A dashed node is an old node with a modified word-fragment. 

    We enter the "alize" node with our insertion string as "al" (to understand this, realize that we had started at the top node with our insertion string as "real" but as "re" was the word fragment contained in the top node, we removed that prefix from our insertion string and submitted the new insertion string "al" into the sub-node).

    Anyways, to repeat, we enter the the "alize" node with our insertion string as "al".  We see that our insertion string is the prefix of our target string. At this point we create a new node here.  We make "al" the word-fragment of our new node, and we turn its word-indicator to ON.  We shift the old node down and change its word-fragment by removing the prefix of our "al" insertion string.  Hence, "ize" is the shifted node, still pointing to all its previous children.

    Quick Summary:
    if ( insertion string is prefix to target string) then "extend branch internally"
       where "extend branch internally" = use the target string as 
                                          a new node's word-fragment, 
                                          and point this new node to the 
                                          old node with shortened word-fragment. 
    Adding "relied" (insertion string and target string share a prefix)

    In this situation "relied" will extend the branch in two directions.  This is the trickiest of our scenarios.

    Here we enter the node with our insertion string as "lied" and our target string as "lief".  Unlike the previous algorithm, the insertion string is not a prefix of the target string.  Nor is the converse true, "lief" is not a prefix of "lied".  But, we do notice that both the insertion and target string share a prefix.

    So now, we create two new nodes, one with the shared prefix ("lie"), the other with the suffix of the insertion string ("d"). We point the shared prefix node to the insertion suffix node and to the old node,  making sure to modify the old node to change its word-fragment to the target string suffix ("f").

    Quick Summary:
    if ( insertion string and target string share prefix) then "create fork"
       where "create fork" = 1) use the target string shared prefix as a new node's word-fragment,
                             2) use the target strings suffix as a new node's word-fragment, now point #1 to this one
                             3) point this new node to the old node with shortened word-fragment. 

    Adding "rebates" (target string is prefix of insertion string)
    In this situation "rebates" will extend the branch (with the growth outwards at the leaf node).

    Here we enter the node with "bates" as our insertion string and "bate" as our target string.  We notice immediately that the target string is a prefix of the insertion string.

    At this point, we take the subtracted suffix of our insertion string "s" and see if there is a child for this.  If not, we add it to the graph.  We create a new node with a word-fragment of "s" and point the current node to this new node (make sure to turn on its word-indicator).

    Adding "reb" (target string is equal to insertion string)
    In this "reb" will switch an indicator to TRUE.  This is probably the easiest of our scenarios, since nothing what we do modifies the structure of the tree.  All we are doing is recognizing that a node that was previously an intermediate non-word node, has now become a word node.

    Here we enter the node and see that our target string "b" and our insertion string "b" are the same. All we do is switch the word-indicator to ON if it was not already on.  If it was already on, then I think that proves that this word has already been inserted. Right?

    Finally (or is it?)
    After all these words have been added, and we have:
    where this tree shows the transformations that took place. 

    Remember, my arbitrary legend:
    • a grey node is brand new with a dotted arrow showing a new edge
    • a dashed node is a modification to a node (whether changing the word fragment or switching an indicator on)
    Let's let this thing cool down, and we get our final tree:

    But Are We Done?

    At the top of this post we said that creation of a trie should be treated as repeated applications of insertion, and if we want to use these collected algorithms, we will need to do one more thing and that is define a starting trie.

    So let's define the degenerate case.

    Degenerate Trie (aka Empty Trie, aka the Top Node)

    Our degenerate trie should be empty.  It should be the prefix of all words but not actually have a word-fragment, (or if you prefer, it's word-fragment should be "").  In any case, this is analogous to how regular expressions have beginning of line characters (^) or beginning of word characters (\b) -- anchors.

    The key to using this degenerate case is to make sure that any `isPrefix` operator in any programming language will treat this as a prefix to all words.  If not, we can always add an is-top indicator to our data structure and modify our algorithms to check prefixes and whether the node is the top node.  We'll discuss that later.

    Now that we have defined a degenerate trie, we can see what our first insertion will look like:


    Our final tree upon repeated applications of insertion will be:


    Next Steps

    In our first post on dictionary tries, we discussed 1) what a trie was, 2) how to retrieve words from it, and 3) what its advantages were.

    This post focused on how to create that trie from insertions.  We got two things at once.

    For our next step we could go many different ways.  Perhaps we could write delete code?  Maybe we need to start implementing this in some programing language?  Is there anyway to create an extremely basic algebra for tries (to add and subtract tries from each other)?  

    A Dictionary Trie

    If you were given the following words:


    You will immediately notice that they all share the same prefix.

    So now if I asked you to create a data structure to store these words (don't worry about why yet), their shared prefix should get you thinking about trees.

    So let us create a tree.

    As you note, the algorithm to retrieve the words from this tree is simple:

    Retrieval Algorithm
    1. Start at the root and walk down to each end node.
    2. As you walk down, accumulate the string fragments stored in the node by concatenation: "re" + "bate" yields "rebate".
    3. Once you reach a leaf node, you have completed a word.

    If we were to program this data structure, it would consist of a single node structure with pointers to other nodes (as children). The following psuedo-code shows this.

    Let's continue refining our tree structure. First thing you will note is that the above tree is only one level deep.

    This highlights the obvious need to continue generating prefixes down the tree. For instance, the fragments "bate" and "born" share a prefix of a "b", and "alizes" and "alize" share "alize".

    So let's grow branches to our tree:
    Now this data structure is more efficiently storing prefixes than the previous. But does our algorithm to retrieve words still work? (Hint: no. (oops, that wasn't a hint))

    Flawed Retrieval Algorithm
    1. Start at the root and walk down to each end node.
    2. As you walk down, accumulate the string fragments stored in the node by concatenation.
    3. Once you reach a leaf node, you have completed a word.
    So what fails here?

    Step 3 does. Our new tree has an entire word as a prefix and we would reach a word before we reached a leaf node.

    The word "realize" is a prefix to the word "realizes" but our third step in the algorithm would skip it. Same with "red" and "redder".

    Solution? Easy. Let's add extra information to each node. We will call this boolean information a word-indicator:
    New Retrieval Algorithm
    1. Start at the root and walk down to each end node.
    2. As you walk down, accumulate the string fragments stored in the node by concatenation.
    3. Once you reach a leaf node OR if you hit a word-indicator (represented above as an asterisk), you have completed a word.
    Our new node in standard UML:

    So that's it. With this node definition and the above algorithm, we have accomplished a lot.

    1. We have explained the Dictionary Trie data structure.
    2. We have explained a simple algorithm for retrieving words from this trie.
    So what are the advantages of this data structure?
    1. Efficient space storage
    2. Word lookup by prefix
    Examples of this second advantage abound. When you start typing in your browser and the browser gives you auto-complete, something like the dictionary trie is being used. Code editors with auto-complete probably use tries too, though augmented with semantic information.

    I decided to use the idea for a dictionary trie when writing a boggle solver many years ago trying to learn Common Lisp. My problem was to search a random matrix of characters (the boggle board) for words. A naïve recursive search starting at each character in the matrix and crawling through the possibilities blows up exponentially. A quick solution to this is to stop crawling for possible words after you have hit a prefix string that does not exist in the dictionary. (Why search for words of the form "ggrxz-" if you know "ggrxz" is not a prefix of any word? "Ggrxzanacious" is a great word though.)

    We have shown what a trie is, and how to retrieve words from it. Next is to show how to create one, and to start defining its capabilities (how to add words, delete words, etc).