Haskell for all: Parsing chemical substructures

Monday, October 15, 2012

Parsing chemical substructures

This is the second in a series of coding examples from my own work. /r/programming asked for non-trivial, yet digestible, Haskell examples, so I hope this fits the bill.

In this post I will show how learning Haskell changes the way you think. I will take a common Haskell idiom (monadic parsing) and apply it in a new way to solve a structural bioinformatics problem. To make this post more accessible to non-Haskell programmers, I'm going to gloss over implementation details and instead try to discuss at a high-level how I apply Haskell idioms to solve my problem.

The `Parser` type

When we approach parsing problems we normally focus on the text as the central element of our algorithm and design our program around combining and manipulating text. Monadic parsing turns this approach on its head and places the parser front and center, where we instead combine and manipulate parsers.

This means we need to define a Parser type:

type Parser a = String -> [(a, String)]

A Parser a takes a starting String as its input and parses it to return a single value of type a and the remaining unconsumed String. There might be multiple valid parses, so we actually return a list of possible parsings instead of a single one. If the parse fails, we return an empty list signifying no valid parsings.

We can also use Haskell's newtype feature to encapsulate the implementation and hide it from the user:

newtype Parser a
  = Parser { runParser :: String -> [(a, String)] }

This defines the Parser constructor which we use to wrap our parsing functions:

Parser :: (String -> [(a, String)]) -> Parser a

.. and the runParser function which unwraps the Parser to retrieve the underlying function:

runParser :: Parser a -> (String -> [(a, String)])

Additionally, this encapsulation enables some "magic" later on.

Parsers

Haskell has a Bool data type, defined as:

data Bool = True | False

... so let's define a Parser that parses a True value:

import Data.List

parseTrue :: Parser Bool
parseTrue = Parser (\str ->
    if (isPrefixOf "True" str)
    then [(True, drop 4 str)] -- Parse succeeds: 1 result
    else []                   -- Parse fails   : 0 results
    )

Similarly, we can define a Parser for False:

parseFalse :: Parser Bool
parseFalse = Parser (\str ->
    if (isPrefixOf "False" str)
    then [(False, drop 5 str)] -- Parse succeeds: 1 result
    else []                    -- Parse fails   : 0 results
    )

Let's test out our parsers using ghci:

>>> runParser parseTrue "True Story"
[(True, " Story")]
>>> runParser parseFalse "True Story" -- Fails: not False
[]
>>> runParser parseFalse "Falsehood"
[(False, "hood")]
>>> runParser parseFalse " Falsehood" -- Fails: leading space
[]

What if we want to also skip initial spaces? We can define a parser that always succeeds and returns the string trimmed of all leading spaces:

skipSpaces :: Parser ()
skipSpaces = Parser (\str ->
    [((), dropWhile (== ' ') str)] -- Always succeeds: 1 result
    )

Let's confirm it works:

>>> runParser skipSpaces "          Falsehood"
[((), "Falsehood")]
>>> runParser skipSpaces "Apple"
[((), "Apple")]

Monads

Now we want to combine our Parsers so we can parse both a True and a False with optional spaces in between. This means we need some elegant way to take the unconsumed input from each Parser and feed it directly into the next Parser in the chain.

Fortunately, Haskell solves this problem cleanly using monads. A Monad defines an interface to two functions:

class Monad m where
    return :: a -> m a
    (>>=)  :: m a -> (a -> m b) -> m b

Like interfaces in any other language, we can program generically to that interface. Haskell's do notation works with this generic Monad interface, so we can use the imperative do syntax to manipulate anything that implements Monad.

Our Parser implements the Monad interface quite nicely:

instance Monad Parser where
    return a = Parser (\str  -> [(a, str)])
    m >>= f  = Parser (\str1 ->
        -- This is a list comprehension, basically
        [(b, str3) | (a, str2) <- runParser m     str1,
                     (b, str3) <- runParser (f a) str2]
        )

This is not a monad tutorial, so I'm glossing over why that is the correct definition or what it even means, but if you want to learn more about monads, I highly recommend: You could have invented monads.

When we make Parser a Monad, we gain the ability to assemble Parsers using do notation, so let's use do notation to combine multiple parsers in an imperative style:

trueThenFalse :: Parser (Bool, Bool)
trueThenFalse = do
    t <- parseTrue
    skipSpaces
    f <- parseFalse
    return (t, f)

That reads just like imperative code: parse a True, skip some spaces, then parse a False. Finally, return the two values you parsed. This seems straightforward enough until you realize we haven't actually parsed any text, yet! All we've done is combine our smaller parsers into a larger parser poised to be run on as many inputs as we please.

Let's make sure it works as advertised:

>>> runParser trueThenFalse "True   False leftovers"
[((True, False), " leftovers")]
>>> runParser trueThenFalse "False   True"
[]

Alternatives

Sometimes we want to try multiple parsing alternatives. For example, what if I want to parse a True or a False? I can define a (<|>) operator that tries both parsers and then returns the union of their results:

(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = Parser (\str ->
    runParser p1 str ++ runParser p2 str
    )

Now I can parse a Bool value without specifying which one and the parser will return which one it parsed:

parseBool :: Parser Bool
parseBool = parseFalse <|> parseTrue

>>> runParser parseBool "True Story"
[(True, " Story")]
>>> runParser parseBool "Falsehood"
[(False, "hood")]

Parsing chemistry

Parsers have one more feature that might surprise you: There is nothing String-specific about them! With one tiny modification, we can generalize them to accept any type of input:

newtype Parser s a
  = Parser { runParser :: s -> [(a, s)] }

instance Monad (Parser s) where
    <EXACT same code as before>

(<|>) :: Parser s a -> Parser s a -> Parser s a
(<|>) = <EXACT same code as before>

Since s is "polymorphic", we can set it to any conceivable type and the above code still works. The only String-specific behavior lies within the specific parser definitions, and their new compiler-inferred types reflect that:

-- These parsers only accept Strings as input
parseFalse    :: Parser String Bool
parseTrue     :: Parser String Bool
trueThenFalse :: Parser String (Bool, Bool)

But there's no reason we can't define entirely different Parsers that accept completely different non-textual input, such as chemical structures. So I'll switch gears and define parsers for chemical Structures, where a Structure is some sort of a labeled graph:

data Structure = Structure {
    graph :: Graph          , -- Adjacency list
    atoms :: Vector AtomName} -- The node labels

... with some convenience functions I've defined for manipulating the Graph:

-- Return the edges of the graph
bonds :: Graph -> [Edge]

-- Remove an edge from the graph
deleteBond :: Edge -> Graph -> Graph

Now, I can define new Parsers that operate on Structures instead of Strings.

Parsing bonds

The most primitive parser I'm interested in parses a single bond. It requires two AtomNames which specify what kind of bond to look for (i.e. a carbon-carbon bond, except it can be even more specific). Then, it outputs which two indices in the graph matched that bond-specification:

parseBond :: AtomName -> AtomName -> Parser Structure (Int, Int)

I can use the list monad (i.e. a list comprehension) to define this primitive parser (and don't worry if you can't precisely follow this code):

parseBond name1 name2
  = Parser $ \(Structure oldGraph atoms) -> do
    -- The first atom must match "name1"
    i1 <- toList (findIndices (== name1) atoms)

    -- Some neighboring atom must match "name2"
    i2 <- filter (\i -> atoms ! i == name2) (oldGraph ! i1)

    -- Remove our matched bond from the graph
    let newGraph = deleteBond (i1, i2) oldGraph

    -- .. and return the matched indices
    return ((i1, i2), Structure newGraph atoms)

Haskell strongly encourages a pure functional style, which keeps me from "cheating" and using side effects or mutation to do the parsing. By sticking to a pure implementation, I gain several bonus features for free:

If our bond occurs more than once, this correctly matches each occurrence, even if some matches share an atom
If both AtomNames are identical, this correctly returns both orientations for each matched bond
This handles backtracking with (<|>) correctly
I can parallelize the search easily since every search branch is pure

I got all of that for just 6 lines of code!

Parsing substructures

Now I can build more sophisticated parsers on top of this simple bond parsers. For example, I can build a generic substructure parser which takes a sub-Structure to match and returns a list of matched indices:

parseSubstructure :: Structure -> Parse Structure [Int]

Again, if you don't precisely understand the code, that's okay:

parseSubstructure (Structure graph as)
    -- Use the State monad to keep track of matches
  = (`evalStateT` (V.replicate (V.length as) Nothing)) $ do

        -- foreach (i1, i2) in (bonds graph):
        forM_ (bonds graph) $ \(i1, i2) -> do

            -- Match the bond
            (i1', i2') <- lift $ parseBond (as ! i1) (as ! i2)

            -- The match must be consistent with other matches
            matches    <- get
            let consistent i1 i1' = case (matches ! i1) of
                    Nothing   -> True
                    Just iOld -> iOld == i1'
            guard (consistent i1 i1' && consistent i2 i2')

            -- Update the match list
            put (matches // [(i1, Just i1'), (i2, Just i2')])

        -- Return the final list of matches
        matchesFinal <- get
        justZ . sequence . toList $ matchesFinal

Like before, the code detects all matching permutations and backtracks if any step fails.

Reusable abstractions

I find it pretty amazing that you can build a substructure parser in just 18 lines of Haskell code. You might say I'm cheating because I'm not counting the amount of lines of code I took to define the Parser type, the Monad implementation, and the (<|>) type. However, the truth is I can actually get all of those features using 1 line of Haskell:

type Parser s = StateT s []
-- and rename all 'Parser' constructors to 'StateT'

So I lied: it's actually 19 lines of code.

I don't expect the reader to know what StateT or [] are, but what you should take away from this is that both of them are part of every Haskell programmer's standard repertoire of abstractions.

Moreover, when I combine them I automatically get a correct Monad implementation (i.e. do notation) and a correct Alternative implementation (which provides the (<|>) function), both for free!

Conclusions

This is just one of many abstractions I used to complete a structural search engine for proteins. Now that it's done, I'll be blogging more frequently about various aspects of the engine's design to give people ideas for how they could use Haskell in their own projects. I hope these kinds of code examples pique people's interest in learning Haskell.

Appendix

I've included the full code for the String-based Parsers. The Structure-based Parsers depend on several project-specific data types, so I will just release them later as part of my protein search engine and perhaps factor them out into their own library.

Also, as a stylistic note, I prefer to use ($) to remove dangling final parentheses like so:

Parse (\str ->        =>  Parse $ \str ->
    someCode          =>      someCode
    )                 =>

... but I didn't want to digress from the post's topic by explaining how the ($) operator behaves.

import Data.List

newtype Parser a
  = Parser { runParser :: String -> [(a, String)] }

parseTrue :: Parser Bool
parseTrue = Parser (\str ->
    if (isPrefixOf "True" str)
    then [(True, drop 4 str)]
    else []
    )

parseFalse :: Parser Bool
parseFalse = Parser (\str ->
    if (isPrefixOf "False" str)
    then [(False, drop 5 str)]
    else []
    )

skipSpaces :: Parser ()
skipSpaces = Parser (\str ->
    [((), dropWhile (== ' ') str)]
    )

instance Monad Parser where
    return a = Parser (\str  -> [(a, str)])
    m >>= f  = Parser (\str1 ->
        [(b, str3) | (a, str2) <- runParser m     str1,
                     (b, str3) <- runParser (f a) str2]
        )

trueThenFalse :: Parser (Bool, Bool)
trueThenFalse = do
    t <- parseTrue
    skipSpaces
    f <- parseFalse
    return (t, f)

(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = Parser (\str ->
    runParser p1 str ++ runParser p2 str
    )

parseBool :: Parser Bool
parseBool = parseFalse <|> parseTrue

15 comments:

AnonymousOctober 22, 2012 at 1:11 PM
Do you have any performance statistics for your structural search engine? For example, how big is a typical Structure graph and how long does your implementation take to search it? Have you compared it to an optimized C/C++ implementation?

I've tinkered with Haskell before but never managed to get anything close to C/C++ performance out of it. However, I've been working on numerical PDE solvers, which is quite a different sort of problem, so maybe Haskell just isn't so suitable for that sort of thing.
ReplyDelete
Replies
Andrew DalkeOctober 22, 2012 at 3:37 PM
(Reddit is down so I moved the conversation here.) The conversation there is:

Me:

I've done molecular structure parsing for a long time, and I don't understand what you explained. First, what is your graph data structure? I get that atoms only have a name, and that a bond either exists or doesn't (that is, there's no bond type information; that's reasonable for structural bioinformatics though not for cheminformatics.) Does that mean that each bond link is stored twice? That is, if the atom at index 5 has 8 as an adjacent atom then the atom at index 8 has 5 as an adjacent? That appears to be the case.

(A cheminformatics package would have the atoms pointing to a bond, with bond type and aromaticity information, and the bond type would point back to each of the two atoms it's connected to.)

You do a 'deleteBond', and I couldn't tell if deletion goes as O(n) or O(log n) time. For your code to make sense, it must be the latter. Your molecules are protein sized, but how big are your substructures?

I can't figure out what your substructure patterns look like. You're matching on atom names, and atom names in proteins are unique in a residue, yes? This means there's very little backtracking. Even if you match on element name, there's relatively little call for extensive backtracking. If that's the case then that would explain why parseSubstructure is so short. I usually expect something like Ullmann or VF2 for the matching, and from what I can see, your algorithm for the general case is much less efficient. That is, it looks like you do successive linear scans of the entire bond list, which implies a n*m time, where n is the number of atoms in the molecule and m the number of atoms in the subgraph.

I've a question of the <|> operator. What would happen if you wanted to find if a single substructure containing a linear chain of 12 carbon atoms exists in a buckyball? Would the (<|>) operator end up generating all 118440 matches? Or would you just get the first that it finds? How long does it take?

You replied:

The graph is undirected but since the Graph type is directed I store bonds as two directed edges and similarly delete them in pairs.

The graph is stored as an adjacency array, so the time complexity of deletion is only proportional to the worst case valency, which is 4 for proteins so it is constant (one array lookup, delete an element from a linked list of maximum size 4, repeat for the the other direction).

Edit: Or maybe the correct term is adjacency list (I'm still new to graphs)? It's a node-indexed array where you store linked lists of neighbors, if that clears it up.

Everything you said about substructure matching is correct except that I don't scan the entire bond or atom list. I just scan the atom's neighbors. Again, because valency is capped proteins are the epitome of sparse graphs so this is efficient.

Regarding your last question, Haskell is lazy, so it computes only as many results as I demand. I always take the first result so this prevents the combinatorial blowup. But, it does not guarantee that the first match won't require a lot of backtracking (especially for parsing something like a buckyball). I specialized this algorithm to protein structures and their chemistry. I should have made that clear in the post, however the topic was on the novelty of parsing non-textual things and not about the algorithm's efficiency.

Also, since you are interested in backtracking, Haskell has a very sophisticated backtracking library named LogicT that also handles pruning and fair branching. I didn't use it because it was unnecessary for the small motifs that I index.

Also, I use several other tricks to keep the search performant. The most significant one is that I partition the structure into pages of fixed volume since I have a specified upper bound on the size of each query (In this case, a cube of 15 angstroms). This is probably the optimization you were looking for me to talk about.
ReplyDelete
Replies
AnonymousOctober 22, 2012 at 3:59 PM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJune 13, 2013 at 2:14 PM
Good article. My 2 pence:

It is better to use Haskell for all tasks. C++ can't efficiently use multiple cores and has data race issues. Haskell is slightly slower than C++ but an optimized Haskell can be slower than C++ just by a factor of 2.

Dear Mr.Dalke,
you have pointed that u can employ programmer for a couple of extra months to do the task in C++ than use Haskell. I feel that Haskellers can do another new project/task in that period. C++, Java, scripting languages are a failure now. There is no way to handle concurrency and parallelism in these languages efficiently.
The author of the article rightly pointed that we can throw more hardware than use programmer time.
=>Haskell is the future whether u like it or not. That's it.
ReplyDelete
Replies

Add comment