Probably Done Before: June 2010

Implementing the Dictionary Trie: Insertion and Search Algorithms: Haskell

In a previous post, I settled on a data structure for the Dictionary Trie. In this post, I present the Haskell code I used to implement the search and insertion algorithms. To understand what motivates the dictionary trie, make sure to read the following posts (including an informal discussion on the algorithms used in the insertion):

The Data Structure

Highlights:

Algebraic
Parametrically Polymorphic

As I had mentioned on my previous post about refining this data structure, the TrieNode is the essential component to the tree. I have used an enumeration type (NodeType) to indicate the node type rather than a set of booleans, in the belief that Haskell's pattern matching could make the code that much easier to understand.

The Insertion Algorithm
The insertion algorithm exposes one public API function:

addWordToTrie

which takes a string (actually an array of the parametrically typed value) and the dictionary trie.

There are several utility functions that I do not export out of the module (none of the module code at the top is shown here). These utility functions use internal utility data structures that are concerned with the type of match that two strings have towards each other -- the source string is always compared against a target string and there's always one of five different matches:

exact match
target is prefix of source
source is prefix of target
shared prefix
nothing in common

In the below code, I have inserted a graphic image as a visual description of the algorithm for the four different cases that you would actually run into during the insertion. The utility function insertNode is a direct mapping (and formal encoding) of the informal scenarios discussed in this previus post. The correspondence to the informally developed algorithms and the actual functions was quite pleasing, and was made possible by the pattern-matching capabilities of the language

Exact Match

Target Is Smaller Than Source

Source Is Smaller Than Target

Source and Target Share Same Prefix

The Querying Algorithms

Discussion
These are the subjects I want to touch on in my creation of the above code:

Keeping things functional with record syntax
Just Maybe stuff
Potentially too redundant in spots
my first use of home-grown operators
does pattern matching on record syntax get me that much?
no magic
functionally passing back self

Keeping Thing Functional with Record Syntax

In order to make a 'change' in the functional world, what you have to do is actually make a new value. For values that are aggregates of others, like our data structure, you need to make sure that the old values for the other stuff remains the same, so essentially you 'copy' the other values. I was extremely pleased to see that with the record syntax, you could code such that a 'modification' to one field in the record affected the overall change leaving the rest of the fields the same as previous. This required the use of Haskell's pattern-matching to name the overall structure/node.

Just Maybe Stuff

I know that I am not properly understanding the way for dealing with Maybe values. I feel that there has to be a better way for extracting the underlying value than the where-clause utility functions I use. Does this reflect an underlying misapprehension of the pholosophy of Maybe values? Maybe.

Potentially Too Redundant

I had two internal data structures to represent the matching of the source and target strings. One was purely a description of the matching that occurred (the type MatchType) and consisted of five nullary constructor data types. The other redundantly captured the state of the match but also encapsulated the prefix-suffix pair that resulted. I have a strong suspicion I could have done this with one pass. I will revisit.

First Use of Homegrown Operators

This is an issue of terseness and psychology. Once I had the algorithms developed and everything working according to plan, I refactored some prefix-oriented functions into operators. I turned:

insertNewChildNode into *->
takeTogetherWhile into =><=

In the end, this reduced the camelCaseClutterAndExcessinveJavaLikeNaming™, but as these functions were internal there probably was no real gain.

Pattern Matching Cost Benefit Analysis

The record syntax (as a data constructor) permits pattern matching on the left-hand side of function definitions or on case expressions. There were several times I wondered whether the use of the pattern-matching was really worth it. While it did lower the terseness on the right-hand side (no need to fetch the data out of the record with 'accessor' functions), it did raise the level of noise on the left-hand side.

No Magic

As I've been diving deep into Haskell, I've been itching to try out the magic that provides the allure to this language. I have been attracted to function composition and currying. I've desired to chain together functions in a point-free style or employ functors or applicative style idioms. I found that my code showed little of this magic. The reasons could be (and/or):

I am new to a different paradigm and still thinking idiomatically imperative
I am dealing with programming in the large which is really just massaging, manipulation, and transformation of record syntaxes.

Functionally Passing Back Self

The utility function insertNode is the heart of the insertion algorithm and is tree-recursive. As mentioned above, the functional nature of Haskell requires that in tree updates, we are actually doing tree 'copies' (the underlying implementation may share state to increase performance, but as far as the language semantics go, the values should all be new and copied). To accomplish this, functions which are tree recursive have to always do the same thing:

they need to pass back the node they are currently operating on as the result of the function
and assume that the invocations on themselves (the recursive call) are 'modifications' to the child nodes

This grew tedious after a while, and felt like unnecessary boilerplate. It was stuff like this that drew the development for higher-order functions like fold and map, and I considered exploring the benefit of casting the Trie as a functor, or of implementing a zipper data structure.

I will revisit this.

Another Digression: Tries for Capturing Pronunciation

Those linguist guys came up with an alphabet of symbols to try to capture the totality of human sounds as a sequence of characters. They called this alphabet IPA. Here is the IPA for the word 'antediluvian'

\ˌan-ti-də-ˈlü-vē-ən, -(ˌ)dī-\

I just had an idea that maybe we can use words of IPA pronunciation in a trie and it might give us some useful stuff. For instance, 'does one word sound like another?', both at the front and at the end -- 'friendship' and 'phrenology' sound alike near the front (at least, according to me), and 'cone' and 'loan' have a similar ending sound (i.e. they rhyme -- anytime).

I am going to make a lot of assumptions, but as I am not attempting to be a linguist (not today), I can get away with doing some cool stuff while waving my hand with the pronouncement that this is just an approximation. The title of this blog is probably done before. I am sure that someone has done this before, but I am not going to google it -- I want to have fun exploring this on my own.

Plan

Find a dictionary of IPA words. (Strictly speaking I need to find two parallel dictionaries: an IPA dictionary along with the script word: a mapping from words of an IPA language (ˌan-ti-də-ˈlü-vē-ən, -(ˌ)dī-) to words of a written language (antediluvian).
Insert these IPA words into two tries: one with the constituent IPA characters from left to right, and one right to left. The first tree should tell me 'friendship' and 'phrenology' share a pronunciation prefix -- even though they don't share the same written-word prefix. The second trie should tell me that 'loan' and 'cone' rhyme (despite having different suffixes). Here is where my hand-waving should come in handy.

This should be fun, and I am sure I am going to run into a LOT of problems.

Dictionary Trie Digression: Auto-Generating Those Pretty Graphs

I am Lazy (with a capital L)
Some say laziness is a hallmark of a programmer.

Did you think I hand drew this tree with Visio or Office Draw ?

Well, no.

As I started to post a couple of weeks ago, I realized that I needed to show tree structures early and often. Originally, I was doing something silly like this:

---"r"
    |
    |
    ---"ea"
    |    |
    |    |
    |    ---"l"
    |    |
    |    |
    |    ---"r"
    |
    |
    ---"ock"

but then blogspot mangled it, and I realized it wasn't as pretty as a graphic like the one above. I figured to myself that there MUST be a simple domain specific language for rendering simple graphs/trees based on a text. Ideally, something like this:

a -> b -> c
     b -> d

should generate something like this:

Well, guess what? This is exactly a language, and the above snippet was used to generate this image. The language is called Dot and is declarative and simple. A package called Graphviz provides a nice set of free tools for generating images from a Dot file.

I Continued Being Lazy
I did not want to install these free tools. I figured, rightly so, that somewhere, somebody had probably wrapped the Graphviz tool-set with HTTP. O brave new world of web-services, ajax and SOA! A little googling and I found my savior: http://graph.gafol.net/. And so now, I had everything I needed. Time to stop being lazy.

Laziness Tabled: Utility Functions for Rendering Tries to Dot
So back in a previous post I had come up with my final (for now) data structure for a Trie (as an inductively generated set of nodes).

Here it is:


data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)

With only this structure in mind, it is possible to write the code to generate the dot
language. Here is the code:


data Dot = DotPath {fid::String, tid::String} | 
           DotNode{dotId::String,label::String}

instance Show Dot where
    show DotPath{fid=from,tid=to@(x:_)} = "\"" ++ from ++ "\"" 
                                          ++  "->"
                                          ++ "\"" ++ to ++ "\""
                                          ++ "[label=\" " ++ ( x : "\"]")
    show DotNode{dotId=nodeId,label=outputLabel} = "\"" ++ nodeId ++ "\"" 
                                                ++  "[label= \"" ++ outputLabel ++ "\"]" 

generateDotLabel nodeDatum = case (nodeType nodeDatum) of 
                    Top -> "∅"
                    Word ->  wf ++ "(*)" 
                    NonWord ->   wf 
     where wf = (wordFragment nodeDatum)

generateDot :: TrieNode Char -> String -> [Dot]
generateDot nodeDatum path = myDotNode : myPaths ++ myChildrensDots 
    where myDotNode = DotNode{dotId= path,label=(generateDotLabel nodeDatum)}
          myPaths = map generatePathNode $ (Map.keys  (children nodeDatum))
          myChildrensDots = foldr (++) []  $

                                  map generateChildrenDots $ (Map.assocs  (children nodeDatum))
          generatePathNode x = DotPath{fid=path,tid=x:path}
          generateChildrenDots (k,v) = generateDot v (k:path)
          

generateDotPretty tree = L.intercalate "\n" $ map show  $ generateDot tree "_"

When I ran it on the Trie generated by the following words:

testWords = ["rebate",
             "reborn",
             "realize",
             "real",
             "relied",
             "rebates",
             "reb",
             "relief",
             "realizes",
             "redder",
             "red"]

I get this:

"_"[label= "∅"]
"_"->"r_"[label=" r"]
"r_"[label= "re"]
"r_"->"ar_"[label=" a"]
"r_"->"br_"[label=" b"]
"r_"->"dr_"[label=" d"]
"r_"->"lr_"[label=" l"]
"ar_"[label= "al(*)"]
"ar_"->"iar_"[label=" i"]
"iar_"[label= "ize(*)"]
"iar_"->"siar_"[label=" s"]
"siar_"[label= "s(*)"]
"br_"[label= "b(*)"]
"br_"->"abr_"[label=" a"]
"br_"->"obr_"[label=" o"]
"abr_"[label= "ate(*)"]
"abr_"->"sabr_"[label=" s"]
"sabr_"[label= "s(*)"]
"obr_"[label= "orn(*)"]
"dr_"[label= "d(*)"]
"dr_"->"ddr_"[label=" d"]
"ddr_"[label= "der(*)"]
"lr_"[label= "lie"]
"lr_"->"dlr_"[label=" d"]
"lr_"->"flr_"[label=" f"]
"dlr_"[label= "d(*)"]
"flr_"[label= "f(*)"]

You can copy this into http://graph.gafol.net/ and see that it generates the image at the top of this post.

Some Observations

Dot Id's
instances of type classes
pattern matching
generate dot can't be used on any types (note the type) -- or 'does not play well with arbitrarily contained types'
probably not efficient
use of a wrapper data structure

Dot Id's

I could not use the word fragment itself as the id, otherwise we would introduce nodes with multiple parents (an acyclic graph). The Dot language allows me to use any arbitrary sequence of characters as the id, so long as they are surrounded by quotations. So, I generated the ids from the path down to the node, pushing the character key onto the path as we descended recursively -- initializing the path with a bogus start character of an underscore '_'. You can verify that the id's are in the reverse order of the descending path. The word fragments are used, of course, to render the label of the node.

Instances of Type Classes

This was the first time I overrode show and purposely put my data structure (the TrieNode) into a type class (namely Show). As I have read the literature, type classes are Haskell's unique contribution to the functional community, (aside from assembling everything together in one nice package). Type classes allow for a disciplined framework for ad-hoc polymorphism. Another disciplined framework for ad-hoc polymorphism is subclassing in Java and overriding a method name. I might revisit overriding show ( using the (++) concatenation operator), but, for now, this is how I pretty printed the Dot commands.

Pattern Matching

For those following along (not just me), that here is the justification in my earlier post for pattern matching with a contained data type kicks in. The function generateDotLabel uses a case expression to match against the nullary data constructors of NodeType. In essence, this is a nice symbolic switch statement. There are plenty of other pattern matches going on -- I particularly like the deconstruction of a List as (x:xs). You can see this at work in my definition of show for the DotPath. Here is where understanding left from right comes in handy. The (x:_) on the left hand side of the equation is a pattern match, matching the x to the first element of the list and using the underscore to disregard the tail. On the right hand side of the equation, as I am assembling the string, we see (x: path). This is not a pattern match. This is an insertion operation. They look alike. They are not. Pay attention to what side of the equation you are on.

Does Not Play Well with Arbitrary Contained Types

In an earlier post I stated the desire to make the TrieNode parametric. I have made it parameteric. See the type parameter 'a' in the data declaration below.


data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)

However when writing the code for generating the Dot, it became clear to me that unless I added type constraints ( Show a => ) to my data declaration, then I could not guarantee that my generating functions could render to Dot. I enforced this by explicitly declaring the type for the main function for rendering.

generateDot :: TrieNode Char -> String -> [Dot]

This is something I plan on revisiting as it is possible to render the 'structure' but auto-generate the labels. Gotta think about that.

Probably Not Efficient
Yup. I am generating an array of Dot data types from a recursive operation. More efficiently I should have had an accumulator that enters each sub-frame of the call-stack. I am not sure I am inefficient. I just suspect it. The main function is (the where clause is left off):

generateDot nodeDatum path = myDotNode : myPaths ++ myChildrensDots

which just looks like it might not be TCO, what with the concatenation operations and myChildrensDots representing the results of the recursive calls. On the other hand I am not sure I understand enough of Haskell's laziness to say whether this is really that that bad.

New Wrapper Data Type
Originally I had written this code as a series of string concatenation operations as I traversed the tree. This grew ugly. I introduced the Dot data structure with its two data constructors DotNode and DotPath to capture everything I need (for my cases) in a single Dot output line. The process of rendering these data structure as strings is left for the overridden show instance and for a series of cool chained list functions:

generateDotPretty tree = L.intercalate "\n" $ map show  $ generateDot tree "_"

Luckily the Dot language is declarative and requires very little sequencing guarantees (i.e. you can put these commands in whatever order).

Implementing the Dictionary Trie: Refining (Redefining) the Data Structure

When learning a new programming language, sometimes analysis paralysis occurs. Having read through many Haskell tutorials, I have become reacquainted with the novice problem (and the psychological confusion) of how to model. I have learned a new set of features. Now, should I use type classes, parametric polymorphism, sum algebraic data types? (Mind you, I am only just learning the theory behind some of these terms).

I am reminded of learning OO with Java, and of not knowing when to use interfaces or abstract classes, or when to favor inheritance over composition. Familiarity and experience eventually affords you an artistic eye (and a mental model of the constraints of that language) to see just exactly what features of the language should be used. Obviously (being new to Haskell), I have not yet developed this sense.

Tangent: I am finding it even more difficult to come up with a mental model in Scala, which combines OO with functional programming. Scala has a lot of tools: traits, absract classes, case classes, abstract type members, generic type parameters, singleton objects. With Clojure, on the other hand, it seems I can only use hashmaps -- (perhaps this is an advantage?) Although defprotocol and deftype might be a response for named models.

I Have Cheated
So the purpose of this post was to show a refinement of the data structure. I tried to do it through intuition alone (posting simultaneously as I figured things out), but in the end I went ahead and implemented the search and insertion algorithms in order to produce the following refinement.

Let's review. From the previous post:


data TrieNode  = TrieNode { wordFragment :: String, children :: Map.Map Char TrieNode
                            isTop :: Bool, isWord :: Bool}

Refinement:


data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)

Balancing Flexibility and Over-Engineering
Classic problem. Time, elegance, maintainability, desire to learn -- all can make something too flexible (over-engineered), or too rigid (under-engineered).

If we are designing a dictionary trie, why not try to make it abstract and reusable? Why not spend the extra effort to generify the data and the functions so that it can be used in multiple scenarios?

If the above paragraph was too abstract itself, consider the following questions. Should our dictionary trie only be used for prefixes of actual human language words? Or should we use the word word, abstractly, and permit an arbitrary alphabet, from bits to anything else? Could we use our dictionary trie to store IP routing information, for instance?

Here is where I find Haskell's parametric polymorphism very compelling: I had already written the insertion, and search algorithms with the String-typed data declaration. I allowed Haskell's type inference to take care of most of my functions inasmuch as I did not declare function types. Upon deciding to try to be a bit more abstract, I changed all references to String to [a] and all references of Char to a, and it almost worked immediately -- (almost, because I had some debug functions which invoked the functions with strings and therefore allowed the compiler's inference algorithm to cascade the String assumption).

Horse before the cart. Let's review the new code


data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)

So what's new here?

Generic data types (parametric polymorphism applied to the type constructor) -- i.e. the use of a type parameter 'a'
The use of an enumeration type NodeType (here's where I get to apply my jargon : it's a sum type of nullary data constructors)

The first point, (the use of a type parameter 'a'), has given us the advantage of genericity. The second point, (using a new data type to enumerate states of the node), is mainly for clarity but it also gives us a slightly better ability to pattern match in our insert and search code.

To prove this last point, we will have to start developing the algorithms we discussed in previous posts for search and insertion.

Implementing the Dictionary Trie: Defining the Data Structures

Purpose

In this post, we are going to start implementing the dictionary trie. To start this we will design the data structure. Below is a UML-ish representation of a TrieNode

Because I want to learn some new (to me) functional languages (Haskell, Clojure, and Scala), and this blog is more for my benefit than anything else, we will use Haskell in this post.

I won't go over the benefit of functional languages here. There are many other posts/articles that do a better job explaining/selling functional languages over other paradigms.

Quick Review of Trie Node

We came up with this UML'is representation of a single node of our data structure back in the very first post.

The "wordFragment" stores the substring of the word at that point in the trie. For leaf nodes, these word fragments are guaranteed to be suffixes to a word in the dictionary.
The "children" is a set of pointers to other TrieNodes. It is this aspect that allows this single node to become a tree. Do not read too much into the array notation.
The "isWord" indicator is needed to indicate if a non-leaf tree is a word end (although we will use it in leaf nodes as well). All leaf nodes are guaranteed to be word ends.
The "isTop" indicator is used only for the special degenerate top-level of the trie. Please review the second post Dictionary Trie Insertion for the explanation of why this trie is needed.

Haskell Implementation First Draft


data TrieNode  = TrieNode { wordFragment :: String, children :: [TrieNode] 
                            isTop :: Bool, isWord :: Bool}

This is absolutely sufficient for implementing everything required. We have here a Haskell record syntax which provides us some syntactic sugar and some autogenerated functions (loosely akin to "getters" for people coming from an OO background).

In order to initialize a top node, we would:


topNode  = TrieNode { wordFragment ="" , children = [] , isTop = True, isWord = False}

But let's consider some details.

Walking Down the Tree
Remember this tree (from the second post)?

We can see the arrows leading down from the prefixes to the end nodes. Walking down on the left hand side, we get: "re" + "al" + "ize" + "s" = "realizes". But there is something missing (in both this diagram and in our Haskell record syntax) in describing how we represent which arrow points to which node.

We need something more like this:

which shows that we can access our sub-nodes by assigning the first character in the child node's "wordFragment" to the edge.

So let's change our code a bit, and instead of using a a list to represent the children in our node, we will use a Map (with a capital M, to remove ambiguity with the function 'map').


import qualified Data.Map as Map

data TrieNode  = TrieNode { wordFragment :: String, children :: Map.Map Char TrieNode
                            isTop :: Bool, isWord :: Bool}

There is nothing wrong with a list. It would be just inelegant to find the appropriate sub-child by searching through the list, when a hash lookup based on the first character would get us better performance.