Implementing the Dictionary Trie: Defining the Data Structures

Purpose

In this post, we are going to start implementing the dictionary trie.  To start this we will design the data structure. Below is a UML-ish representation of  a TrieNode

Because I want to learn some new (to me) functional languages (HaskellClojure, and Scala), and this blog is more for my benefit than anything else, we will use Haskell in this post.  

I won't go over the benefit of functional languages here.  There are many other posts/articles that do a better job explaining/selling functional languages over other paradigms.

Quick Review of Trie Node


We came up with this UML'is representation of a single node of our data structure back in the very first post.
  1. The "wordFragment" stores the substring of the word at that point in the trie.  For leaf nodes, these word fragments are guaranteed to be suffixes to a word in the dictionary.
  2. The "children" is a set of pointers to other TrieNodes. It is this aspect that allows this single node to become a tree. Do not read too much into the array notation.
  3. The "isWord" indicator is needed to indicate if a non-leaf tree is a word end (although we will use it in leaf nodes as well).  All leaf nodes are guaranteed to be word ends.  
  4. The "isTop" indicator is used only for the special degenerate top-level of the trie.  Please review the second post Dictionary Trie Insertion for the explanation of why this trie is needed.
Haskell Implementation First Draft

data TrieNode  = TrieNode { wordFragment :: String, children :: [TrieNode] 
                            isTop :: Bool, isWord :: Bool} 

This is absolutely sufficient for implementing everything required.  We have here a Haskell record syntax which provides us some syntactic sugar and some autogenerated functions (loosely akin to "getters" for people coming from an OO background).

In order to initialize a top node, we would:

topNode  = TrieNode { wordFragment ="" , children = [] , isTop = True, isWord = False} 

But let's consider some details.

Walking Down the Tree
Remember this tree (from the second post)?

We can see the arrows leading down from the prefixes to the end nodes. Walking down on the left hand side, we get: "re" + "al" + "ize" + "s" =  "realizes". But there is something missing (in both this diagram and in our Haskell record syntax) in describing how we represent which arrow points to which node.

We need something more like this:

which shows that we can access our sub-nodes by assigning the first character in the child node's "wordFragment" to the edge.

So let's change our code a bit, and instead of using a a list to represent the children in our node, we will use a Map (with a capital M, to remove ambiguity with the function 'map').

import qualified Data.Map as Map

data TrieNode  = TrieNode { wordFragment :: String, children :: Map.Map Char TrieNode
                            isTop :: Bool, isWord :: Bool} 

There is nothing wrong with a list.  It would be just inelegant to find the appropriate sub-child by searching through the list, when a hash lookup based on the first character would get us better performance.

No comments:

Post a Comment