- What is a dictionary trie?
- How we would go about creating a dictionary trie from inserts.
- What is a basic Haskell data structure for a dictionary trie.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data NodeType = Top|Word|NonWord | |
deriving (Show,Eq) | |
data TrieNode a = TrieNode { wordFragment :: [a], children :: (Map.Map a (TrieNode a)), | |
nodeType:: NodeType } deriving (Show,Eq) |
Highlights:
- Algebraic
- Parametrically Polymorphic
As I had mentioned on my previous post about refining this data structure, the TrieNode is the essential component to the tree. I have used an enumeration type (NodeType) to indicate the node type rather than a set of booleans, in the belief that Haskell's pattern matching could make the code that much easier to understand.
The Insertion AlgorithmThe insertion algorithm exposes one public API function:
- addWordToTrie
which takes a string (actually an array of the parametrically typed value) and the dictionary trie.
There are several utility functions that I do not export out of the module (none of the module code at the top is shown here). These utility functions use internal utility data structures that are concerned with the type of match that two strings have towards each other -- the source string is always compared against a target string and there's always one of five different matches:
- exact match
- target is prefix of source
- source is prefix of target
- shared prefix
- nothing in common
In the below code, I have inserted a graphic image as a visual description of the algorithm for the four different cases that you would actually run into during the insertion. The utility function insertNode is a direct mapping (and formal encoding) of the informal scenarios discussed in this previus post. The correspondence to the informally developed algorithms and the actual functions was quite pleasing, and was made possible by the pattern-matching capabilities of the language
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
data PrefixSuffixTree a = ExactMatchDatum { shared :: a } | | |
SharedPrefixDatum { shared :: a, suffixT :: a, suffixS :: a } | | |
TargetIsSmallerDatum { pre :: a, suffixS :: a } | | |
SourceIsSmallerDatum { pre :: a, suffixT :: a } | | |
NoMatchDatum deriving (Show,Eq) | |
data MatchType = TargetIsPrefix | SourceIsPrefix | | |
SharedPrefix | ExactMatch | NotEqual deriving (Show) | |
-- utility functions | |
switchToWord node = node{nodeType = Word} | |
takeTogetherWhile :: (Eq a) => [a] -> [a] -> PrefixSuffixTree [a] | |
takeTogetherWhile [] _ = NoMatchDatum | |
takeTogetherWhile _ [] = NoMatchDatum | |
takeTogetherWhile source target = case match of | |
ExactMatch -> ExactMatchDatum {shared=source} | |
TargetIsPrefix -> TargetIsSmallerDatum {pre=target, | |
suffixS= suffixStringOfSource} | |
SourceIsPrefix -> SourceIsSmallerDatum {pre=source, | |
suffixT= suffixStringOfTarget} | |
NotEqual -> NoMatchDatum | |
SharedPrefix -> SharedPrefixDatum { shared = sharedPrefix , | |
suffixS = utilStrip sharedPrefix source , | |
suffixT = utilStrip sharedPrefix target} | |
where match = findMatchOnString source target | |
suffixStringOfSource = utilStrip target source | |
suffixStringOfTarget = utilStrip source target | |
utilStrip s t = case L.stripPrefix s t of | |
Just s -> s | |
Nothing -> [] | |
sharedPrefix = [ fst x | x <- takeWhile ( \ (x,y) -> x==y) $ zip source target ] | |
source =><= target = takeTogetherWhile source target | |
findMatchOnString sourceS@(s:ss) targetS@(t:ts) | sourceS == targetS = ExactMatch | |
| sourceS `L.isPrefixOf` targetS = SourceIsPrefix | |
| targetS `L.isPrefixOf` sourceS = TargetIsPrefix | |
| s /= t = NotEqual | |
| otherwise = SharedPrefix | |
insertNewChildWordNode word@(x:xs) childNodes = (wordNode word) *-> childNodes | |
insertNewChildNode node@TrieNode{wordFragment=word@(x:xs)} childNodes | |
= Map.insert x node childNodes | |
x *-> y = insertNewChildNode x y | |
-- match for the top node | |
addWordToTrie word@(x:xs) node@(TrieNode{children=childNodes,nodeType=Top} ) | |
| Map.notMember x childNodes = node{children= (insertNewChildWordNode word childNodes)} | |
| otherwise = node{ children= newlyChangedNode *-> childNodes} | |
where subNode s = childNodes ! s | |
newlyChangedNode = addWordToTrie word (subNode x) | |
-- match for the others | |
addWordToTrie source@(x:xs) node@(TrieNode{children=childNodes, wordFragment=target} ) | |
= case matchData of | |
TargetIsSmallerDatum {pre=target, | |
suffixS= suffixStringOfSource@(s:ss)} | |
| Map.notMember s childNodes -> insertNode node matchData | |
| otherwise -> node{ children = newlyChangedNode *-> childNodes} | |
where subNode = (childNodes ! ) | |
newlyChangedNode = addWordToTrie suffixStringOfSource (subNode s) | |
otherwise -> insertNode node matchData | |
where matchData = source =><= target |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
insertNode contextNode ExactMatchDatum{} = switchToWord contextNode |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- the following is guaranteed not to have something at the child node as we | |
-- checked it above | |
insertNode contextNode@TrieNode{children=childNodes} | |
TargetIsSmallerDatum {pre=target, | |
suffixS= suffixStringOfSource} | |
= contextNode{ children=(insertNewChildWordNode suffixStringOfSource childNodes) } |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- the following is for trie nodes where the source is smaller | |
insertNode contextNode@TrieNode{children=childNodes,nodeType=oldNodeType} | |
SourceIsSmallerDatum {pre=source, | |
suffixT= suffixStringOfTarget} | |
= contextNode{ wordFragment = source, | |
nodeType = Word, | |
children= slidNode *-> childNodes } | |
where slidNode = contextNode{wordFragment=suffixStringOfTarget,nodeType=oldNodeType} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
-- this will be the complicated case | |
insertNode contextNode@TrieNode{children=childNodes,nodeType=oldNodeType} | |
SharedPrefixDatum { shared = sharedPrefix , | |
suffixS = sourceSuffix@(s:ss), | |
suffixT = targetSuffix@(t:ts)} | |
= newForkNode {children = ourTwoNewNodes } | |
where newForkNode = nonWordNode sharedPrefix | |
ourTwoNewNodes = Map.fromList [(s,newSourceSuffixNode),(t,newTargetSuffixNode) ] | |
newSourceSuffixNode = wordNode sourceSuffix | |
newTargetSuffixNode = contextNode{ wordFragment = targetSuffix} |
The Querying Algorithms
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
queryNode source node = bool | |
where (bool,_) = queryNodeWithPath source node [] | |
queryNodeWithPath :: (Ord a) => [a] -> TrieNode a -> [[a]] -> (Bool,[[a]]) | |
queryNodeWithPath source@(s:_) node@TrieNode{ nodeType = Top, children=ch} acc | |
| Map.member s ch = queryNodeWithPath source (ch ! s) acc | |
| otherwise = (False,acc) | |
queryNodeWithPath source node@TrieNode{ wordFragment = target, nodeType=nT, children=ch } acc = case matchData of | |
ExactMatchDatum {shared=source} | |
| nT == Word -> (True, (target:acc)) | |
| otherwise -> (False,acc) | |
TargetIsSmallerDatum {suffixS= suffixStringOfSource@(s:_)} | |
| Map.member s ch -> queryNodeWithPath suffixStringOfSource (ch ! s) (target:acc ) | |
| otherwise -> (False,acc) | |
otherwise -> (False,acc) | |
where matchData = source =><= target |
Discussion
These are the subjects I want to touch on in my creation of the above code:
- Keeping things functional with record syntax
- Just Maybe stuff
- Potentially too redundant in spots
- my first use of home-grown operators
- does pattern matching on record syntax get me that much?
- no magic
- functionally passing back self
Keeping Thing Functional with Record Syntax
In order to make a 'change' in the functional world, what you have to do is actually make a new value. For values that are aggregates of others, like our data structure, you need to make sure that the old values for the other stuff remains the same, so essentially you 'copy' the other values. I was extremely pleased to see that with the record syntax, you could code such that a 'modification' to one field in the record affected the overall change leaving the rest of the fields the same as previous. This required the use of Haskell's pattern-matching to name the overall structure/node.
Just Maybe Stuff
I know that I am not properly understanding the way for dealing with Maybe values. I feel that there has to be a better way for extracting the underlying value than the where-clause utility functions I use. Does this reflect an underlying misapprehension of the pholosophy of Maybe values? Maybe.
Potentially Too Redundant
I had two internal data structures to represent the matching of the source and target strings. One was purely a description of the matching that occurred (the type MatchType) and consisted of five nullary constructor data types. The other redundantly captured the state of the match but also encapsulated the prefix-suffix pair that resulted. I have a strong suspicion I could have done this with one pass. I will revisit.
First Use of Homegrown Operators
This is an issue of terseness and psychology. Once I had the algorithms developed and everything working according to plan, I refactored some prefix-oriented functions into operators. I turned:
- insertNewChildNode into *->
- takeTogetherWhile into =><=
In the end, this reduced the camelCaseClutterAndExcessinveJavaLikeNaming™, but as these functions were internal there probably was no real gain.
Pattern Matching Cost Benefit Analysis
The record syntax (as a data constructor) permits pattern matching on the left-hand side of function definitions or on case expressions. There were several times I wondered whether the use of the pattern-matching was really worth it. While it did lower the terseness on the right-hand side (no need to fetch the data out of the record with 'accessor' functions), it did raise the level of noise on the left-hand side.
No Magic
As I've been diving deep into Haskell, I've been itching to try out the magic that provides the allure to this language. I have been attracted to function composition and currying. I've desired to chain together functions in a point-free style or employ functors or applicative style idioms. I found that my code showed little of this magic. The reasons could be (and/or):
- I am new to a different paradigm and still thinking idiomatically imperative
- I am dealing with programming in the large which is really just massaging, manipulation, and transformation of record syntaxes.
Functionally Passing Back Self
The utility function insertNode is the heart of the insertion algorithm and is tree-recursive. As mentioned above, the functional nature of Haskell requires that in tree updates, we are actually doing tree 'copies' (the underlying implementation may share state to increase performance, but as far as the language semantics go, the values should all be new and copied). To accomplish this, functions which are tree recursive have to always do the same thing:
- they need to pass back the node they are currently operating on as the result of the function
- and assume that the invocations on themselves (the recursive call) are 'modifications' to the child nodes
This grew tedious after a while, and felt like unnecessary boilerplate. It was stuff like this that drew the development for higher-order functions like fold and map, and I considered exploring the benefit of casting the Trie as a functor, or of implementing a zipper data structure.
I will revisit this.
Thank you for the great series... Trie implementation in Haskell.
ReplyDeleteI am missing "wordNode" and "nonWordNode" definitions in the code snippets above..
Any idea ?
Im extremely impressed with your writing skills and also with the layout in your blog
ReplyDeleteYour personal stuffs is beautiful. All the time deal with it up!
ReplyDeleteIt's a very helpful section of information. We're happy that you discussed this
ReplyDeleteI can not wait to learn far more from you. This is really a tremendous web site.
ReplyDeleteContinue on inspiring us with your writing skills! Your work inspires us
ReplyDelete