Implementing the Dictionary Trie: Insertion and Search Algorithms: Haskell

In a previous post, I settled on a data structure for the Dictionary Trie.  In this post, I present the Haskell code I used to implement the search and insertion algorithms.   To understand what motivates the dictionary trie, make sure to read the following posts (including an informal discussion on the algorithms used in the insertion):
  1. What is a dictionary trie?
  2. How we would go about creating a dictionary trie from inserts.
  3. What is a basic Haskell data structure for a dictionary trie.
The Data Structure
data NodeType = Top|Word|NonWord
deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], children :: (Map.Map a (TrieNode a)),
nodeType:: NodeType } deriving (Show,Eq)
view raw gistfile1.hs hosted with ❤ by GitHub

Highlights:
  1. Algebraic
  2. Parametrically Polymorphic
As I had mentioned on my previous post about refining this data structure, the TrieNode is the essential component to the tree.  I have used an enumeration type (NodeType) to indicate the node type rather than a set of booleans, in the belief that Haskell's pattern matching could make the code that much easier to understand.
The Insertion Algorithm
The insertion algorithm exposes one public API function:

  • addWordToTrie
which takes a string (actually an array of the parametrically typed value) and the dictionary trie. 
There are several utility functions that I do not export out of the module (none of the module code at the top is shown here).  These utility functions use internal utility data structures that are concerned with the type of match that two strings have towards each other -- the source string is always compared against a target string and there's always one of five different matches:  
  • exact match
  • target is prefix of source
  • source is prefix of target
  • shared prefix
  • nothing in common
In the below code, I have inserted a graphic image as a  visual description of the algorithm for the four different cases that you would actually run into during the insertion.  The utility function insertNode is a direct mapping (and formal encoding) of the informal scenarios discussed in this previus post. The correspondence to the informally developed algorithms and the actual functions was quite pleasing, and was made possible by the pattern-matching capabilities of the language

data PrefixSuffixTree a = ExactMatchDatum { shared :: a } |
SharedPrefixDatum { shared :: a, suffixT :: a, suffixS :: a } |
TargetIsSmallerDatum { pre :: a, suffixS :: a } |
SourceIsSmallerDatum { pre :: a, suffixT :: a } |
NoMatchDatum deriving (Show,Eq)
data MatchType = TargetIsPrefix | SourceIsPrefix |
SharedPrefix | ExactMatch | NotEqual deriving (Show)
-- utility functions
switchToWord node = node{nodeType = Word}
takeTogetherWhile :: (Eq a) => [a] -> [a] -> PrefixSuffixTree [a]
takeTogetherWhile [] _ = NoMatchDatum
takeTogetherWhile _ [] = NoMatchDatum
takeTogetherWhile source target = case match of
ExactMatch -> ExactMatchDatum {shared=source}
TargetIsPrefix -> TargetIsSmallerDatum {pre=target,
suffixS= suffixStringOfSource}
SourceIsPrefix -> SourceIsSmallerDatum {pre=source,
suffixT= suffixStringOfTarget}
NotEqual -> NoMatchDatum
SharedPrefix -> SharedPrefixDatum { shared = sharedPrefix ,
suffixS = utilStrip sharedPrefix source ,
suffixT = utilStrip sharedPrefix target}
where match = findMatchOnString source target
suffixStringOfSource = utilStrip target source
suffixStringOfTarget = utilStrip source target
utilStrip s t = case L.stripPrefix s t of
Just s -> s
Nothing -> []
sharedPrefix = [ fst x | x <- takeWhile ( \ (x,y) -> x==y) $ zip source target ]
source =><= target = takeTogetherWhile source target
findMatchOnString sourceS@(s:ss) targetS@(t:ts) | sourceS == targetS = ExactMatch
| sourceS `L.isPrefixOf` targetS = SourceIsPrefix
| targetS `L.isPrefixOf` sourceS = TargetIsPrefix
| s /= t = NotEqual
| otherwise = SharedPrefix
insertNewChildWordNode word@(x:xs) childNodes = (wordNode word) *-> childNodes
insertNewChildNode node@TrieNode{wordFragment=word@(x:xs)} childNodes
= Map.insert x node childNodes
x *-> y = insertNewChildNode x y
-- match for the top node
addWordToTrie word@(x:xs) node@(TrieNode{children=childNodes,nodeType=Top} )
| Map.notMember x childNodes = node{children= (insertNewChildWordNode word childNodes)}
| otherwise = node{ children= newlyChangedNode *-> childNodes}
where subNode s = childNodes ! s
newlyChangedNode = addWordToTrie word (subNode x)
-- match for the others
addWordToTrie source@(x:xs) node@(TrieNode{children=childNodes, wordFragment=target} )
= case matchData of
TargetIsSmallerDatum {pre=target,
suffixS= suffixStringOfSource@(s:ss)}
| Map.notMember s childNodes -> insertNode node matchData
| otherwise -> node{ children = newlyChangedNode *-> childNodes}
where subNode = (childNodes ! )
newlyChangedNode = addWordToTrie suffixStringOfSource (subNode s)
otherwise -> insertNode node matchData
where matchData = source =><= target
view raw gistfile1.hs hosted with ❤ by GitHub

  Exact Match
insertNode contextNode ExactMatchDatum{} = switchToWord contextNode
view raw gistfile1.hs hosted with ❤ by GitHub

Target Is Smaller Than Source
-- the following is guaranteed not to have something at the child node as we
-- checked it above
insertNode contextNode@TrieNode{children=childNodes}
TargetIsSmallerDatum {pre=target,
suffixS= suffixStringOfSource}
= contextNode{ children=(insertNewChildWordNode suffixStringOfSource childNodes) }
view raw gistfile1.hs hosted with ❤ by GitHub

Source Is Smaller Than Target
-- the following is for trie nodes where the source is smaller
insertNode contextNode@TrieNode{children=childNodes,nodeType=oldNodeType}
SourceIsSmallerDatum {pre=source,
suffixT= suffixStringOfTarget}
= contextNode{ wordFragment = source,
nodeType = Word,
children= slidNode *-> childNodes }
where slidNode = contextNode{wordFragment=suffixStringOfTarget,nodeType=oldNodeType}
view raw gistfile1.hs hosted with ❤ by GitHub

Source and Target Share Same Prefix
-- this will be the complicated case
insertNode contextNode@TrieNode{children=childNodes,nodeType=oldNodeType}
SharedPrefixDatum { shared = sharedPrefix ,
suffixS = sourceSuffix@(s:ss),
suffixT = targetSuffix@(t:ts)}
= newForkNode {children = ourTwoNewNodes }
where newForkNode = nonWordNode sharedPrefix
ourTwoNewNodes = Map.fromList [(s,newSourceSuffixNode),(t,newTargetSuffixNode) ]
newSourceSuffixNode = wordNode sourceSuffix
newTargetSuffixNode = contextNode{ wordFragment = targetSuffix}
view raw gistfile1.hs hosted with ❤ by GitHub

The Querying Algorithms
queryNode source node = bool
where (bool,_) = queryNodeWithPath source node []
queryNodeWithPath :: (Ord a) => [a] -> TrieNode a -> [[a]] -> (Bool,[[a]])
queryNodeWithPath source@(s:_) node@TrieNode{ nodeType = Top, children=ch} acc
| Map.member s ch = queryNodeWithPath source (ch ! s) acc
| otherwise = (False,acc)
queryNodeWithPath source node@TrieNode{ wordFragment = target, nodeType=nT, children=ch } acc = case matchData of
ExactMatchDatum {shared=source}
| nT == Word -> (True, (target:acc))
| otherwise -> (False,acc)
TargetIsSmallerDatum {suffixS= suffixStringOfSource@(s:_)}
| Map.member s ch -> queryNodeWithPath suffixStringOfSource (ch ! s) (target:acc )
| otherwise -> (False,acc)
otherwise -> (False,acc)
where matchData = source =><= target
view raw gistfile1.hs hosted with ❤ by GitHub

Discussion
These are the subjects I want to touch on in my creation of the above code:
  1. Keeping things functional with record syntax
  2. Just Maybe stuff
  3. Potentially too redundant in spots  
  4. my first use of home-grown operators
  5. does pattern matching on record syntax get me that much?
  6. no magic
  7. functionally passing back self
Keeping Thing Functional with Record Syntax
In order to make a 'change' in the functional world, what you have to do is actually make a new value.  For values that are aggregates of others, like our data structure, you need to make sure that the old values for the other stuff remains the same, so essentially you 'copy' the other values.  I was extremely pleased to see that with the record syntax, you could code such that a 'modification' to one field in the record affected the overall change leaving the rest of the fields the same as previous.  This required the use of Haskell's pattern-matching to name the overall structure/node.  
Just Maybe Stuff
I know that I am not properly understanding the way for dealing with Maybe values.  I feel that there has to be a better way for extracting the underlying value than the where-clause utility functions I use. Does this reflect an underlying misapprehension of the pholosophy of Maybe values? Maybe.
Potentially Too Redundant
I had two internal data structures to represent the matching of the source and target strings.  One was purely a description of the matching that occurred (the type MatchType) and consisted of five nullary constructor data types.  The other redundantly captured the state of the match but also encapsulated the prefix-suffix pair that resulted.  I have a strong suspicion I could have done this with one pass.  I will revisit.
First Use of Homegrown Operators
This is an issue of terseness and psychology.  Once I had the algorithms developed and everything working according to plan, I refactored some prefix-oriented functions into operators.  I turned:

  • insertNewChildNode into *->
  • takeTogetherWhile into =><=
In the end, this reduced the camelCaseClutterAndExcessinveJavaLikeNaming, but as these functions were internal there probably was no real gain.
Pattern Matching Cost Benefit Analysis
The record syntax (as a data constructor) permits pattern matching on the left-hand side of function definitions or on case expressions.  There were several times I wondered whether the use of the pattern-matching was really worth it.  While it did lower the terseness on the right-hand side (no need to fetch the data out of the record with 'accessor' functions), it did raise the level of noise on the left-hand side.
No Magic
As I've been diving deep into Haskell, I've been itching to try out the magic that provides the allure to this language.  I have been attracted to function composition and currying.  I've desired to chain together functions in a point-free style or employ functors or applicative style idioms.  I found that my code showed little of this magic.  The reasons could be (and/or):

  • I am new to a different paradigm and still thinking idiomatically imperative
  • I am dealing with programming in the large which is really just massaging, manipulation, and transformation of record syntaxes.
Functionally Passing Back Self
The utility function insertNode is the heart of the insertion algorithm and is tree-recursive.  As mentioned above, the functional nature of Haskell requires that in tree updates, we are actually doing tree 'copies'  (the underlying implementation may share state to increase performance, but as far as the language semantics go, the values should all be new and copied).  To accomplish this, functions which are tree recursive have to always do the same thing:  
  • they need to pass back the node they are currently operating on as the result of the function
  • and assume that the invocations on themselves (the recursive call) are 'modifications' to the child nodes
This grew tedious after a while, and felt like unnecessary boilerplate.  It was stuff like this that drew the development for higher-order functions like fold and map, and I considered exploring the benefit of casting the Trie as a functor, or of implementing a zipper data structure.  
I will revisit this.

6 comments:

  1. Thank you for the great series... Trie implementation in Haskell.
    I am missing "wordNode" and "nonWordNode" definitions in the code snippets above..

    Any idea ?

    ReplyDelete
  2. Im extremely impressed with your writing skills and also with the layout in your blog

    ReplyDelete
  3. Your personal stuffs is beautiful. All the time deal with it up!

    ReplyDelete
  4. It's a very helpful section of information. We're happy that you discussed this

    ReplyDelete
  5. I can not wait to learn far more from you. This is really a tremendous web site.

    ReplyDelete
  6. Continue on inspiring us with your writing skills! Your work inspires us

    ReplyDelete