Dictionary Trie Digression: Auto-Generating Those Pretty Graphs

I am Lazy (with a capital L)
Some say laziness is a hallmark of a programmer.

Did you think I hand drew this tree with Visio or Office Draw ?
 Well, no.

As I started to post a couple of weeks ago, I realized that I needed to show tree structures early and often.  Originally, I was doing something silly like this:
---"r"
    |
    |
    ---"ea"
    |    |
    |    |
    |    ---"l"
    |    |
    |    |
    |    ---"r"
    |
    |
    ---"ock"
but then blogspot mangled it, and I realized it wasn't as pretty as a graphic like the one above.  I figured to myself that there MUST be a simple domain specific language for rendering simple graphs/trees based on a text.  Ideally, something like this:
a -> b -> c
     b -> d
should generate something like this:
Well, guess what? This is exactly a language, and the above snippet was used to generate this image.  The language is called Dot and is declarative and simple.  A package called Graphviz provides a nice set of free tools for generating images from a Dot file.

I Continued Being Lazy
I did not want to install these free tools.  I figured, rightly so, that somewhere, somebody had probably wrapped the Graphviz tool-set with HTTP.  O brave new world of web-services, ajax and SOA!  A little googling and I found my savior:  http://graph.gafol.net/.  And so now, I had everything I needed.  Time to stop being lazy.

Laziness Tabled: Utility Functions for Rendering Tries to Dot
So back in a previous post I had come up with my final (for now) data structure for a Trie (as an inductively generated set of nodes).

Here it is:

data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)


With only this structure in mind, it is possible to write the code to generate the dot
language. Here is the code:


data Dot = DotPath {fid::String, tid::String} | 
           DotNode{dotId::String,label::String}

instance Show Dot where
    show DotPath{fid=from,tid=to@(x:_)} = "\"" ++ from ++ "\"" 
                                          ++  "->"
                                          ++ "\"" ++ to ++ "\""
                                          ++ "[label=\" " ++ ( x : "\"]")
    show DotNode{dotId=nodeId,label=outputLabel} = "\"" ++ nodeId ++ "\"" 
                                                ++  "[label= \"" ++ outputLabel ++ "\"]" 

generateDotLabel nodeDatum = case (nodeType nodeDatum) of 
                    Top -> "∅"
                    Word ->  wf ++ "(*)" 
                    NonWord ->   wf 
     where wf = (wordFragment nodeDatum)

generateDot :: TrieNode Char -> String -> [Dot]
generateDot nodeDatum path = myDotNode : myPaths ++ myChildrensDots 
    where myDotNode = DotNode{dotId= path,label=(generateDotLabel nodeDatum)}
          myPaths = map generatePathNode $ (Map.keys  (children nodeDatum))
          myChildrensDots = foldr (++) []  $
                                  map generateChildrenDots $ (Map.assocs  (children nodeDatum))
          generatePathNode x = DotPath{fid=path,tid=x:path}
          generateChildrenDots (k,v) = generateDot v (k:path)
          

generateDotPretty tree = L.intercalate "\n" $ map show  $ generateDot tree "_"


When I ran it on the Trie generated by the following words:
testWords = ["rebate",
             "reborn",
             "realize",
             "real",
             "relied",
             "rebates",
             "reb",
             "relief",
             "realizes",
             "redder",
             "red"]

I get this:
"_"[label= "∅"]
"_"->"r_"[label=" r"]
"r_"[label= "re"]
"r_"->"ar_"[label=" a"]
"r_"->"br_"[label=" b"]
"r_"->"dr_"[label=" d"]
"r_"->"lr_"[label=" l"]
"ar_"[label= "al(*)"]
"ar_"->"iar_"[label=" i"]
"iar_"[label= "ize(*)"]
"iar_"->"siar_"[label=" s"]
"siar_"[label= "s(*)"]
"br_"[label= "b(*)"]
"br_"->"abr_"[label=" a"]
"br_"->"obr_"[label=" o"]
"abr_"[label= "ate(*)"]
"abr_"->"sabr_"[label=" s"]
"sabr_"[label= "s(*)"]
"obr_"[label= "orn(*)"]
"dr_"[label= "d(*)"]
"dr_"->"ddr_"[label=" d"]
"ddr_"[label= "der(*)"]
"lr_"[label= "lie"]
"lr_"->"dlr_"[label=" d"]
"lr_"->"flr_"[label=" f"]
"dlr_"[label= "d(*)"]
"flr_"[label= "f(*)"]

You can copy this into  http://graph.gafol.net/ and see that it generates the image at the top of this post.

Some Observations

  1. Dot Id's
  2. instances of type classes
  3. pattern matching
  4. generate dot can't be used on any types (note the type) -- or 'does not play well with arbitrarily contained types'
  5. probably not efficient
  6. use of a wrapper data structure
Dot Id's
I could not use the word fragment itself as the id, otherwise we would introduce nodes with multiple parents (an acyclic graph). The Dot language allows me to use any arbitrary sequence of characters as the id, so long as they are surrounded by quotations.  So, I generated the ids from the path down to the node, pushing the character key onto the path as we descended recursively -- initializing the path with a bogus start character of an underscore '_'.  You can verify that the id's are in the reverse order of the descending path.  The word fragments are used, of course, to render the label of the node.

Instances of Type Classes
This was the first time I overrode show and purposely put my data structure (the TrieNode) into a type class (namely Show).   As I have read the literature, type classes are Haskell's unique contribution to the functional community, (aside from assembling everything together in one nice package).  Type classes allow for a disciplined framework for ad-hoc polymorphism.  Another disciplined framework for ad-hoc polymorphism is subclassing in Java and overriding a method name.  I might revisit overriding show ( using the (++) concatenation operator), but, for now, this is how I pretty printed the Dot commands.

Pattern Matching
For those following along (not just me), that here is the justification in my earlier post for pattern matching with a contained data type kicks in.  The function generateDotLabel uses a case expression to match against the nullary data constructors of NodeType.  In essence, this is a nice symbolic switch statement.  There are plenty of other pattern matches going on -- I particularly like the deconstruction of a List as (x:xs).  You can see this at work in my definition of show for the DotPath.  Here is where understanding left from right comes in handy.  The (x:_) on the left hand side of the equation is a pattern match, matching the x to the first element of the list and using the underscore to disregard the tail.  On the right hand side of the equation, as I am assembling the string, we see (x: path).  This is not a pattern match.  This is an insertion operation.  They look alike. They are not.  Pay attention to what side of the equation you are on.

Does Not Play Well with Arbitrary Contained Types
In an earlier post I stated the desire to make the TrieNode parametric.  I have made it parameteric.  See the type parameter 'a' in the data declaration below.

data NodeType = Top|Word|NonWord 
                deriving (Show,Eq)
data TrieNode a = TrieNode { wordFragment :: [a], 
                             children :: (Map.Map a (TrieNode a)), 
                             nodeType:: NodeType }  deriving (Show,Eq)

However when writing the code for generating the Dot, it became clear to me that unless I added type constraints ( Show a => ) to my data declaration, then I could not guarantee that my generating functions could render to Dot.  I enforced this by explicitly declaring the type for the main function for rendering.


generateDot :: TrieNode Char -> String -> [Dot]

This is something I plan on revisiting as it is possible to render the 'structure' but auto-generate the labels. Gotta think about that.

Probably Not Efficient
Yup. I am generating an array of Dot data types from a recursive operation.  More efficiently I should have had an accumulator that enters each sub-frame of the call-stack.  I am not sure I am inefficient.  I just suspect it.  The main function is (the where clause is left off):


generateDot nodeDatum path = myDotNode : myPaths ++ myChildrensDots


which just looks like it might not be TCO, what with the concatenation operations and myChildrensDots representing the results of the recursive calls.  On the other hand I am not sure I understand enough of Haskell's laziness to say whether this is really that that bad.

New Wrapper Data Type
Originally I had written this code as a series of string concatenation operations as I traversed the tree.  This grew ugly.  I introduced the Dot data structure with its two data constructors DotNode and DotPath to capture everything I need (for my cases) in a single Dot output line.  The process of rendering these data structure as strings is left for the overridden show instance and for a series of cool chained list functions:

generateDotPretty tree = L.intercalate "\n" $ map show  $ generateDot tree "_"

Luckily the Dot language is declarative and requires very little sequencing guarantees (i.e. you can put these commands in whatever order).

No comments:

Post a Comment