Implementing the Dictionary Trie: Insertion and Search Algorithms: Haskell

In a previous post, I settled on a data structure for the Dictionary Trie.  In this post, I present the Haskell code I used to implement the search and insertion algorithms.   To understand what motivates the dictionary trie, make sure to read the following posts (including an informal discussion on the algorithms used in the insertion):
  1. What is a dictionary trie?
  2. How we would go about creating a dictionary trie from inserts.
  3. What is a basic Haskell data structure for a dictionary trie.
The Data Structure

Highlights:
  1. Algebraic
  2. Parametrically Polymorphic
As I had mentioned on my previous post about refining this data structure, the TrieNode is the essential component to the tree.  I have used an enumeration type (NodeType) to indicate the node type rather than a set of booleans, in the belief that Haskell's pattern matching could make the code that much easier to understand.
The Insertion Algorithm
The insertion algorithm exposes one public API function:

  • addWordToTrie
which takes a string (actually an array of the parametrically typed value) and the dictionary trie. 
There are several utility functions that I do not export out of the module (none of the module code at the top is shown here).  These utility functions use internal utility data structures that are concerned with the type of match that two strings have towards each other -- the source string is always compared against a target string and there's always one of five different matches:  
  • exact match
  • target is prefix of source
  • source is prefix of target
  • shared prefix
  • nothing in common
In the below code, I have inserted a graphic image as a  visual description of the algorithm for the four different cases that you would actually run into during the insertion.  The utility function insertNode is a direct mapping (and formal encoding) of the informal scenarios discussed in this previus post. The correspondence to the informally developed algorithms and the actual functions was quite pleasing, and was made possible by the pattern-matching capabilities of the language


  Exact Match

Target Is Smaller Than Source

Source Is Smaller Than Target

Source and Target Share Same Prefix

The Querying Algorithms

Discussion
These are the subjects I want to touch on in my creation of the above code:
  1. Keeping things functional with record syntax
  2. Just Maybe stuff
  3. Potentially too redundant in spots  
  4. my first use of home-grown operators
  5. does pattern matching on record syntax get me that much?
  6. no magic
  7. functionally passing back self
Keeping Thing Functional with Record Syntax
In order to make a 'change' in the functional world, what you have to do is actually make a new value.  For values that are aggregates of others, like our data structure, you need to make sure that the old values for the other stuff remains the same, so essentially you 'copy' the other values.  I was extremely pleased to see that with the record syntax, you could code such that a 'modification' to one field in the record affected the overall change leaving the rest of the fields the same as previous.  This required the use of Haskell's pattern-matching to name the overall structure/node.  
Just Maybe Stuff
I know that I am not properly understanding the way for dealing with Maybe values.  I feel that there has to be a better way for extracting the underlying value than the where-clause utility functions I use. Does this reflect an underlying misapprehension of the pholosophy of Maybe values? Maybe.
Potentially Too Redundant
I had two internal data structures to represent the matching of the source and target strings.  One was purely a description of the matching that occurred (the type MatchType) and consisted of five nullary constructor data types.  The other redundantly captured the state of the match but also encapsulated the prefix-suffix pair that resulted.  I have a strong suspicion I could have done this with one pass.  I will revisit.
First Use of Homegrown Operators
This is an issue of terseness and psychology.  Once I had the algorithms developed and everything working according to plan, I refactored some prefix-oriented functions into operators.  I turned:

  • insertNewChildNode into *->
  • takeTogetherWhile into =><=
In the end, this reduced the camelCaseClutterAndExcessinveJavaLikeNaming, but as these functions were internal there probably was no real gain.
Pattern Matching Cost Benefit Analysis
The record syntax (as a data constructor) permits pattern matching on the left-hand side of function definitions or on case expressions.  There were several times I wondered whether the use of the pattern-matching was really worth it.  While it did lower the terseness on the right-hand side (no need to fetch the data out of the record with 'accessor' functions), it did raise the level of noise on the left-hand side.
No Magic
As I've been diving deep into Haskell, I've been itching to try out the magic that provides the allure to this language.  I have been attracted to function composition and currying.  I've desired to chain together functions in a point-free style or employ functors or applicative style idioms.  I found that my code showed little of this magic.  The reasons could be (and/or):

  • I am new to a different paradigm and still thinking idiomatically imperative
  • I am dealing with programming in the large which is really just massaging, manipulation, and transformation of record syntaxes.
Functionally Passing Back Self
The utility function insertNode is the heart of the insertion algorithm and is tree-recursive.  As mentioned above, the functional nature of Haskell requires that in tree updates, we are actually doing tree 'copies'  (the underlying implementation may share state to increase performance, but as far as the language semantics go, the values should all be new and copied).  To accomplish this, functions which are tree recursive have to always do the same thing:  
  • they need to pass back the node they are currently operating on as the result of the function
  • and assume that the invocations on themselves (the recursive call) are 'modifications' to the child nodes
This grew tedious after a while, and felt like unnecessary boilerplate.  It was stuff like this that drew the development for higher-order functions like fold and map, and I considered exploring the benefit of casting the Trie as a functor, or of implementing a zipper data structure.  
I will revisit this.

1 comment:

  1. Thank you for the great series... Trie implementation in Haskell.
    I am missing "wordNode" and "nonWordNode" definitions in the code snippets above..

    Any idea ?

    ReplyDelete