Implementing the Dictionary Trie: Refining (Redefining) the Data Structure

When learning a new programming language, sometimes analysis paralysis occurs.  Having read through many Haskell tutorials, I have become reacquainted with the novice problem (and the psychological confusion) of how to model.  I have learned a new set of features. Now, should I use type classes, parametric polymorphism, sum algebraic data types?  (Mind you, I am only just learning the theory behind some of these terms).

I am reminded of learning OO with Java, and of not knowing when to use interfaces or abstract classes, or when to favor inheritance over composition. Familiarity and experience eventually affords you an artistic eye (and a mental model of the constraints of that language) to see just exactly what features of the language should be used.  Obviously (being new to Haskell), I have not yet developed this sense.

Tangent:  I am finding it even more difficult to come up with a mental model in Scala, which combines OO with functional programming.  Scala has a lot of tools: traits, absract classes, case classes, abstract type members, generic type parameters, singleton objects.  With Clojure, on the other hand, it seems I can only use hashmaps -- (perhaps this is an advantage?)  Although defprotocol and deftype might be a response for named models.

I Have Cheated
So the purpose of this post was to show a refinement of the data structure.  I tried to do it through intuition alone (posting simultaneously as I figured things out), but in the end I went ahead and implemented the search and insertion algorithms in order to produce the following refinement.

Let's review. From the previous post:

data TrieNode  = TrieNode { wordFragment :: String, children :: Map.Map Char TrieNode
                            isTop :: Bool, isWord :: Bool} 
Refinement:

data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)

Balancing Flexibility and Over-Engineering
Classic problem.  Time, elegance, maintainability, desire to learn -- all can make something too flexible (over-engineered), or too rigid (under-engineered).

 If we are designing a dictionary trie, why not try to make it abstract and reusable?  Why not spend the extra effort to generify the data and the functions so that it can be used in multiple scenarios?

If the above paragraph was too abstract itself, consider the following questions.  Should our dictionary trie only be used for prefixes of actual human language words?  Or should we use the word word, abstractly, and permit an arbitrary alphabet, from bits to anything else?  Could we use our dictionary trie to store IP routing information, for instance?

Here is where I find Haskell's parametric polymorphism very compelling:  I had already written the insertion, and search algorithms with the String-typed data declaration.  I allowed Haskell's type inference to take care of most of my functions inasmuch as I did not declare function types.  Upon deciding to try to be a bit more abstract, I changed all references to String to [a] and all references of Char to a, and it almost worked immediately -- (almost, because I had some debug functions which invoked the functions with strings and therefore allowed the compiler's inference algorithm to cascade the String assumption).

Horse before the cart.  Let's review the new code

data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)

So what's new here?

  1. Generic data types (parametric polymorphism applied to the type constructor) -- i.e. the use of a type parameter 'a' 
  2. The use of an enumeration type NodeType (here's where I get to apply my jargon : it's a sum type of nullary data constructors) 
The first point, (the use of a type parameter 'a'), has given us the advantage of genericity.  The second point, (using a new data type to enumerate states of the node), is mainly for clarity but it also gives us a slightly better ability to pattern match in our insert and search code.

To prove this last point, we will have to start developing the algorithms we discussed in previous posts for search and insertion.

No comments:

Post a Comment