Saturday, October 12, 2013

An all-atom protein search engine powered by Haskell

This post discusses a Haskell-based project that is the central component of my thesis: a fast, atomic-level structural search engine named "Suns". I will discuss what problem this search engine solves, why I chose Haskell for this project, and what were the pros and cons of my language choice.


The Challenge

Proteins are amazing molecules that control biology on the molecular level. If you are new to cellular biology, you can get an immediate sense for how diverse and sophisticated they are by watching The Inner Life of the Cell, an artistic rendition of several diverse processes controlled by proteins, including:

  • cellular adhesion (how cells recognize and stick to each other)
  • forming a skeleton for the cell
  • assembling and disassembling cellular highways
  • trafficking cargo along those highways

In addition, proteins also:

  • sense the molecular environment
  • perform chemistry
  • are cellular weapons
  • and much, much more!

I am a graduate student in a research lab that designs new proteins for both medical treatments and scientific study. Unfortunately, proteins are significantly harder to design than software. A typical protein can be very chemically complex and tightly integrated:

A protein structure in atomic detail

Proteins are tangled ropes of amino acids, so they are the epitome of spaghetti code. However, it's not as bad as it sounds. Within these tangled balls we see several patterns emerge, both on small, medium, and large scales.

On a larger scale you have "protein domains". I liken these to high-level scripts: self-contained blobs of functionality with little coupling or state. They are often easy to connect together to generate new functionality:

Surface representation of two connected protein domains, colored green and cyan

On the medium scale you have "secondary structure", consisting primarily of alpha helices and beta strands, which form locally repetitive structures. I liken these to C code: low-level, but still reasonably portable. Our lab has historically had success transplanting patterns in secondary structure to new contexts, generating new functionality.

Secondary structure representation of a protein composed primarily of beta strands

On the small scale you have "motifs", small configurations of a few amino acids that frequently occur in natural proteins. I liken these to assembly code: they are very tightly integrated and pack a lot of utility into a small space, but they are not very portable because they have very precise chemical and geometric requirements and little room for customization:

An example motif, with a geometrically well-defined hydrogen-bonding pattern

Our lab needs to design proteins on this small and compact scale, so they use my search engine to discover and incorporate these small motifs into their designed proteins. The search engine overcomes the "portability issues" of these low-level motifs by finding existing motifs that are exactly geometrically and chemically compatible with partial blueprints.

Overview

The most popular interface to the search engine is through PyMOL, a molecular graphics system that lets you visualize protein structures three-dimensionally. You can find a PyMOL search plugin on the official Suns web site, which also includes an extended tutorial for how to use the search plugin.

The search plugin lets you point and click on multiple protein fragments and then click "Search". Typically within less than a second it will stream up to 100 matches to the fragment of interest, superimposed on your original search query. You can then expand on your original query by incorporating new fragments from the search results:

An exmple iterative design process that begins from a guanidinium group and which grows to an N-terminal helix-capping motif. Black dashes highlight fragments incorporated from search results.

The PyMOL plugin is a thin Python-based client that connects to a remote search backend written in Haskell.


Why Haskell?

I initially did not begin this project in Haskell. I actually began it in C. The reason why was that at the time I did not have a solid background in data structures and algorithms, so I would reflexively turn to C for high-performance projects to try to make up for it by improving the constant factors of my code. However, all-atom searching is the kind of problem where the algorithm matters and the bare-metal performance does not. My initial implentation in C attempted to brute force the solution to no avail, taking hours to perform a single search.

The second problem that I encountered was that when I got the first slow draft working I immediately got additional feature requests and I was not prepared to handle them. C is a very brittle language to program in because of manual memory management and no standard library of data structures. However, like many inexperienced programmers I had trouble saying "no" and bit off more than I could chew trying (and failing) to satisfy these feature requests.

At that time I had been dabbling in Haskell in my free time and took interest in the language. Things that intrigued me about the language included:

  • expressive and strong types
  • an elegant mix of imperative and functional style (using monads)
  • high-level idioms and design patterns

Plus, Haskell is garbage collected and has a rich library of data structures, which solved my two main qualms with C, so I decided to rewrite my code base in Haskell.

This rewrite was not a simple translation of the original C code. One of the main reasons that I transitioned was that I wanted to familiarize myself with improved data structures and algorithms. Haskell made it very easy to prototype new code for several reasons:

  • You have a rich set of high-performance purely functional data structures in the containers, unordered-containers, and vector libraries
  • Garbage collection, type inference, and type classes automate common menial tasks
  • Strong static types make it a breeze to refactor code without having to write tests

Consequently, Haskell gave me much greater confidence to rewrite large swaths of code because the compiler caught more of my mistakes and automated more details. This encouraged me to try out new ideas and rapidly gravitate towards a good solution instead of getting complacent with my first working solution. These changes improved performance and now queries took milliseconds instead of hours. At this point, PyMOL became the bottleneck and could not load search results fast enough, so there was no point improving performance further.

Another major advantage of Haskell for my project is parsing. Haskell libraries excel at parsing by necessity: there is a huge pressure to convert untyped textual data to more strongly typed data structures in order to take advantage of the type system. I want to give a special shoutout to attoparsec and aeson, which are fantastic for this purpose.

Another big benefit of programming in Haskell was that my code was remarkably stable and devoid of bugs. This meant that I spent less time maintaining the code and more time adding features.


Problems with Haskell

The first issue I encountered is that if you program in Haskell you need to be prepared to fill in the missing gaps in the library ecosystem. I began at a time when the Haskell ecosystem was beginning to mature, but there were still several important gaps. The foremost of these was a stream programming library, which I needed to write myself because I was unsatisfied with existing iteratee-based solutions, and this eventually evolved into my pipes stream programming library. If you are considering starting a large project in Haskell you need to be prepared to do some library work of your own.

Another issue was numerical code. This is limited almost exclusively to hmatrix (Haskell bindings to the GNU scientific library) and repa (a fast and native Haskell library, but without a standard library of numerical algorithms). hmatrix was problematic because it is GPLv3 licensed, and my university, UCSF, forbids releasing code with a GPLv3 license because of the patent clause. This meant that I was forced to write my own LAPACK bindings for performance-critical code. Haskell's foreign-function interface (FFI) is nice, but not having to use it at all is even nicer.

Next, space leaks cause problems. Space leaks are like the Haskell analog to C's segmentation faults: they require programmer discipline to prevent and they are not easy to debug. With experience I learned how to avoid them, mainly by:

  • Defining data types with strict fields
  • Ensuring all folds are strict in the accumulator

... but this is still a problem because it bites beginners to the language really early. Also, space leaks are the biggest source of program instability, which is a shame because it compromises what would be an otherwise rock-solid stability story for the language.

The next problem is the lack of standardized error-handling idioms. I ended up writing my own library for this, too, called the errors library, but this did not completely solve the problem either. I'm currently collaborating with John Wiegley and Michael Snoyman (of Yesod fame) to try to solve this problem.

Another problem is the pre-existing ecosystem of functionality dependent on lazy IO. I'm not going to name any names, but one library that I depend on uses lazy IO internally and it is still a source of bugs. Lazy IO causes these sort of problems because it does not guarantee ordering of effects and also results in synchronous exceptions being thrown asynchronously in pure code (which is a big no-no in the Haskell world).


Conclusions

The code for this project is open source and hosted on Github. It consists of three separate projects:

You can also learn more about the project on the official Suns website.

People interested in learning Haskell may find the code for the search engine useful to see what a medium-sized Haskell application looks like. Haskell gets a bad reputation for being overly academic and I hope people can see past that to discover an excellent programming language for rapid development with high reliability.

Wednesday, October 9, 2013

How to reimplement the conduit parsing API in 50 lines of pipes code

Michael's recent blog posts highlighted several deficiences of pipes-based parsing. Particularly, he emphasized that it was incompatible with idioms from the core pipes library, and I agree with that assessment. Programming with pipes-parse is a different beast from programming with vanilla pipes and pipes-parse idioms more closely resemble conduit idioms.

Several comments in response to Michael's post asked if one could internally implement conduit on top of pipes, in order to simplify conduit internals. This post answers half that question by showing how to implement conduit sans finalization on top pipes using the tools from pipes-parse.

This code is short, but very dense, so I will walk through the implementation step-by-step, explaining the underlying pipes-parse idioms that I'm using to reconstruct conduit functionality. If you just want to skip to the complete code then go straight to the Appendix at the end of this post.


The Conduit Type

The way you internally represent a conduit-like parser using pipes is the following data type:

import Pipes
import Control.Monad.Trans.State.Strict

newtype ConduitM i o m r = ConduitM
    { unConduitM ::
         forall x . Producer o (StateT (Producer i m x) m) r }

To recap, a ConduitM i o m r has an input of type i and an output of type o, and the output is distinct from the return value, just like pipes.

I model this as a Producer of os that reads from and writes to a Producer of is. The Producer of is is our conduit's upstream input end. awaiting a value will pop an elements off of this Producer and adding back a leftover pushes an element back onto this Producer.

This representation differs from conduit's implementation in one important way: it makes no distinction between leftovers and future input. Both are stored together within the inner Producer. This is one neat trick of reifying future input as a Producer: you now have an obvious place to store leftovers.


Primitives

The next step is to implement await, which is just a thin wrapper around draw from pipes-parse:

type Consumer i m r = forall o . ConduitM i o m r

await :: (Monad m) => Consumer i m (Maybe i)
await = ConduitM $ lift $ liftM f draw
  where
    f (Left  _) = Nothing
    f (Right r) = Just r

However, this doesn't explain what draw is actually doing, so let's refer to its implementation:

draw = do
    p <- get                 -- Retrieve our source of input
    x <- lift (next p)       -- Attempt to pop one element off
    case x of
        Right (a, p') -> do  -- Success: bind the element
            put p'           -- Save the remainder
            return (Right a)
        Left   r      -> do  -- Failure: No more input
            put (return r)
            return (Left  r)

If you are more comfortable with StateT you might prefer the following less verbose form which inlines all the state passing:

draw = StateT $ \p -> do
    x <- next p
    return $ case x of
        Right (a, p') -> (Right a, p'      )
        Left      r   -> (Left  r, return r)

If you think of a Producer a m () as isomorphic to ListT m a, then next is the equivalent of uncons for ListT.

Similarly, we can add elements back to the producer, using leftover, which is just a thin wrapper around unDraw from pipes-parse:

leftover :: (Monad m) => i -> ConduitM i o m ()
leftover i = ConduitM $ lift $ unDraw i

unDraw has a simple implementation:

unDraw a = modify (Pipes.yield a >>)

It just prepends a yield statement onto the Producer. This is the equivalent of cons for ListT.

What about yield? Well, that's exactly the same:

yield :: (Monad m) => o -> ConduitM i o m ()
yield o = ConduitM (Pipes.yield o)

Composition

Now for the interesting part: conduit composition, which has the following type:

(=$=) :: (Monad m)
      => ConduitM a b m ()
      -> ConduitM b c m r
      -> ConduitM a c m r

If we were to replace these ConduitMs with the underlying pipe type, we would get:

(=$=) :: (Monad m)
      => forall x . Producer b (StateT (Producer a m x) m) ()
      -> forall y . Producer c (StateT (Producer b m y) m) r
      -> forall z . Producer c (StateT (Producer a m z) m) r

How do we even begin to approach that?

The key is the runStateP function from Pipes.Lift, which has the following (simplified) type:

runStateP
    :: s -> Producer a (StateT s m) r -> Producer a m (r, s)

Compare this with the type for runStateT:

runStateT :: StateT s m r -> s -> m (r, s)

runStateP differs from runStateT in two ways:

  • runStateP unwraps a StateT buried inside of a pipe

  • runStateP takes arguments in the opposite order from runStateT

runStateP takes care to thread state as it wraps the internal StateT so it behaves just like runStateT. Once you familiarize yourself with how runStateP works, the solution is a matter of type-chasing. In fact, what you will discover is that if you restrict yourself to runStateP, there is only one solution that type-checks.

We begin with two arguments two our operator:

ConduitM pL =$= ConduitM pR = ConduitM $ ...

pL has type:

pL :: forall x . Producer b (StateT (Producer a m x) m) ()

Let's look at what type we get when we unwrap pL using runStateP:

parseL
    :: (Monad m)
    => Producer a m x -> Producer b m ((), Producer a m x)
parseL as = runStateP as pL

This now looks just like a parser combinator. It takes an input stream of values of type a and generates an output stream of values of type b, returning unused input alongside the () return value. We're not interested in this () return value, so we'll use execStateP instead:

parseL
    :: (Monad m)
    => Producer a m x -> Producer b m (Producer a m x)
parseL as = execStateP as pL

Similarly, we'll convert pR to a parser:

parseR
    :: (Monad m)
    => Producer b m y -> Producer c m (r, Producer b m y)
parseR bs = runStateP bs pR

Now what's our goal? We're trying to build a ConduitM a c m r, which is equivalent to the following parser:

parse
    :: (Monad m)
    => Producer a m z -> Producer c m (r, Producer a m z)

This means that we need to introduce a stream of as:

parse as = do
    -- as :: Producer a m x

We can now feed that stream to parseL

parse as = do
    -- as        :: Producer a m x
    -- parseL as :: Producer b m (Producer a m x)

We can then feed that to parseR. This works because parseR is universally quantified in y, which type-checks as Producer a m x:

parse as = do
    -- as  :: Producer a m x
    -- parseL as
    --     :: Producer b m (Producer a m x)
    -- parseR (parseL as)
    --     :: Producer c m (r, Producer b m (Producer a m x))

This is almost what we want. We just need to discard the unused stream of bs:

parse as = do
    (r, pp) <- parseR (parseL as)
    p'      <- lift $ drain pp
    return (r, p')
  where
    drain p = runEffect (for p discard)

If we inline all of that logic, we get the following 5-line implementation of conduit composition:

ConduitM pL =$= ConduitM pR = ConduitM $
    stateP $ \as -> do
          (r, pp) <- runStateP (execStateP as pL) pR
          p'      <- lift $ drain pp
          return (r, p')

This gives a birds-eye view of how conduit composition works. When we compose two conduits, we:

  • Feed the input stream of as to the upstream conduit
  • Feed that to the downstream conduit
  • Discard all leftovers from their intermediate interface

Once we have this composition operator, the right and left fuse are just type-specializations (the same as in conduit):

type Conduit  i m o = ConduitM i  o    m ()
type Source     m o = ConduitM () o    m ()
type Sink     i m r = ConduitM i  Void m r

($=) :: (Monad m) => Source m a -> Conduit a m b -> Source m b
($=) = (=$=)

(=$) :: (Monad m) => Conduit a m b -> Sink b m c -> Sink a m c
(=$) = (=$=)

What about ($$)? That is even simpler:

empty :: (Monad m) => Producer () m r
empty = return () >~ cat  -- equivalent to "forever $ yield ()"

($$) :: (Monad m) => Source m a -> Sink a m b -> m b
ConduitM pL $$ ConduitM pR =
    evalStateT (runEffect pR) (evalStateP empty pL)

This implementation says at a high-level:

  • Feed an unused leftover stream to pL (unused because it's a Source)
  • Feed that to pR
  • There is no step 3

Identity

If that is composition, what is the identity? Why, it's just input from pipes-parse:

idP :: (Monad m) => ConduitM a a m ()
idP = ConduitM (void input)

Neat how that works out. This is equivalent in behavior to:

idP = do
    ma <- await
    case ma of
        Nothing -> return ()
        Just a  -> do
            yield a
            idP

Connect and Resume

Last but not least we need connect and resume. Like I said before, this will ignore finalization concerns, so I will only implement a variation on ($$+) that returns a new Source, rather than a ResumableSource (which is just a Source tagged with a finalizer).

($$+)
    :: (Monad m)
    => Source m a -> Sink a m b -> m (b, Source m a)
ConduitM pL $$+ ConduitM pR = do
    (b, as) <- runEffect $ runStateP (execStateP empty pL) pR
    let as' = ConduitM $ stateP $ \p -> ((), p) <$ as
    return (b, as')

This says:

  • Feed an unused input stream to pL (it's a Source)
  • Feed that to pR
  • Discard pR's inexistent output (it's a Sink)
  • Create a new Source that also ignores its input stream

Conclusion

The purpose of this post is not to suggest that Michael necessarily should implement conduit in terms of pipes, especially since this does not contain finalization code, yet. Rather, I wanted to exhibit that pipes is a powerful tool that you can use to build other abstractions concisely and with less room for error.

Appendix

The minimal test implementation is 50 lines of code, which I've included here:

{-# LANGUAGE RankNTypes #-}

import Control.Applicative ((<$))
import Control.Monad (liftM, void)
import Pipes hiding (await, yield, Consumer)
import qualified Pipes
import Pipes.Lift
import Pipes.Parse

newtype ConduitM i o m r = ConduitM
    { unConduitM :: forall x .
        Producer o (StateT (Producer i m x) m) r }

instance (Monad m) => Monad (ConduitM i o m) where
    return r = ConduitM (return r)
    m >>= f  = ConduitM $ unConduitM m >>= \r -> unConduitM (f r)

instance MonadTrans (ConduitM i o) where
    lift m = ConduitM (lift (lift m))

type Consumer i m r = forall o . ConduitM i  o    m r
type Source     m o =            ConduitM () o    m ()
type Sink     i m r =            ConduitM i  Void m r
type Conduit  i m o =            ConduitM i  o    m ()

await :: (Monad m) => Consumer i m (Maybe i)
await = ConduitM $ lift $ liftM f draw
  where
    f (Left  _) = Nothing
    f (Right r) = Just r

yield :: (Monad m) => o -> ConduitM i o m ()
yield o = ConduitM (Pipes.yield o)

leftover :: (Monad m) => i -> ConduitM i o m ()
leftover i = ConduitM $ lift $ unDraw i

(=$=)
    :: (Monad m)
    => Conduit a m b
    -> ConduitM b c m r
    -> ConduitM a c m r
ConduitM pL =$= ConduitM pR = ConduitM $
    stateP $ \p -> do
          (r, pp) <- runStateP (execStateP p pL) pR
          p'      <- lift $ drain pp
          return (r, p')

drain :: (Monad m) => Producer a m r -> m r
drain p = runEffect (for p discard)

($=) :: (Monad m) => Source m a -> Conduit a m b -> Source m b
($=) = (=$=)

(=$) :: (Monad m) => Conduit a m b -> Sink b m c -> Sink a m c
(=$) = (=$=)

empty :: (Monad m) => Producer () m r
empty = return () >~ cat

($$) :: (Monad m) => Source m a -> Sink a m b -> m b
ConduitM pL $$ ConduitM pR =
    evalStateT (runEffect pR) (evalStateP empty pL)

idP :: (Monad m) => ConduitM a a m ()
idP = ConduitM (void input)

($$+)
    :: (Monad m)
    => Source m a -> Sink a m b -> m (b, Source m a)
ConduitM pL $$+ ConduitM pR = do
    (b, pa) <- runEffect $ runStateP (execStateP empty pL) pR
    let p' = ConduitM $ stateP $ \p -> ((), p) <$ pa
    return (b, p')

Sunday, October 6, 2013

Manual proofs for the `pipes` laws

Out of all of Haskell's streaming libraries, pipes is the only that does not have a test suite. This is because I prefer to use equational reasoning to prove the correctness of pipes. For those new to Haskell, equational reasoning is the use of equations to prove code is correct without ever running it. Haskell makes this possible by strictly enforcing purity.

Equational reasoning works well for a library like pipes because most properties of interest can easily be specified as category theory laws. If you are unfamiliar with category theory, you can read this short primer to learn how it relates to programming.

Every primitive in the pipes library is related to a category in some way:
  • yield and (~>) are the identity and composition operator of a category
  • await and (>~) are the identity and composition operator of a category
  • cat and (>->) are the identity and composition operator of a category
This means that we state the expected behavior for all six primitives just by stating their category laws:
-- Left Identity
yield ~> f = f

-- Right Identity
f ~> yield = f

-- Associativity
(f ~> g) ~> h = f ~> (g ~> h)


-- Left Identity
await >~ f = f

-- Right Identity
f ~> await = f

-- Associativity
(f >~ g) >~ h = f >~ (g >~ h)


-- Left Identity
cat >-> f = f

-- Right Identity
f >-> cat = f

-- Associativity
(f >-> g) >-> h = f >-> (g >-> h)
Category theory laws are like tests: they are properties that we expect our code to obey if we wrote it correctly. These kinds of laws tend to have the nice property that they tend to uncannily summarize many useful properties we would like to know in a remarkably small set of equations. For example, (~>) is defined in terms of for like this:
(f ~> g) x = for (f x) g
When you restate the category laws for yield and (~>) in terms of for, you get the "for loop laws":
-- Looping over a single yield simplifies to function
-- application

for (yield x) $ \e -> do
    f e

= f x


-- Re-yielding every element of a generator returns the
-- original generator

for gen $ \e -> do
    yield e

= gen


-- Nested for loops can become a sequential for loops if the
-- inner loop body ignores the outer loop variable

for gen0 $ \i -> do
    for (f i) $ \j -> do
        g j

= for gen1 $ \j -> do
    g j
  where
    gen1 = for gen0 $ \i -> do
        f i
These are common sense laws that we would expect any language with generators and for loops to obey. Amazingly, the for loop laws emerge naturally from category laws, almost as if they were handed down to us on stone tablets.

For a while I've been writing these proofs out by hand informally in notebooks, but now that I'm writing up my thesis I needed to organize my notes and fill in all the little steps that I would previously skip over. You can find the full set of my notes as an organized markdown file that I've committed to the pipes repository that contains all my manual proofs of the pipe laws.


Caveats


There are several caveats that I need to state about these proofs:

First, there is the obvious caveat that these proofs are still not machine-checked, but that's the next step. Paolo Capriotti has been patiently teaching me how to encode my reasoning into Agda. However, I believe this should be possible and that the proofs are very close to sound, taking into account the following caveats below.

Second, these proofs do not account for bottom (i.e. _|_). My intuition is that the pipes laws do hold even in the presence of bottom, by I am still new to reasoning about bottom, so I omitted that for now. If anybody can point me to a good reference on this I would be grateful.

Third, these proofs use a non-standard coinductive reasoning. To explain why, it's important to note that typical coinductive reasoning in Haskell requires the result to be productive and to guard coinduction behind the production of a constructor. However, not all pipelines are productive, as the following examples illustrate:
infinite = forever $ yield ()

infinite >-> forever await = _|_

for infinite discard = _|_
So instead of using productivity to impose a total ordering on coinductive reasoning I use "consuming" to order my coinductive proofs, meaning that all uses of coinduction are guarded behind the consumption of a constructor. In other words, my proofs do not guarantee that pipe composition makes progress in production, but they do guarantee that pipe composition makes progress in consumption.

Fourth, another departure from typical coinductive proofs is notation. Coinductive reasoning uses bisimilarity to prove "equality". I also use bisimilarity, but notationally I just write it down as a single chain of equations that is bisected by a reference to coinduction. The two halves of the equation chain, above and below the bisection, constitute the two bisimilar halves.

Fifth, proofs notationally discharge other proofs using headers as the namespace. If you see a reference like [Kleisli Category - Left Identity Law - Pointful], that means you should look under top-level header Kleisli Category for sub-header Left Identity Law and sub-sub-header Pointful.

Sixth, I've only proven the laws for the bidirectional API in the Pipes.Core module. However, the category laws for the unidirectional API are very easy to derive from the laws for the bidirectional API, so I leave those as an exercise until I have time to add them myself.


Conclusions


My goal is to eventually machine-check these proofs so that pipes will be a practical and instructive example of a library with a statically verified kernel. Moreover, I hope that these proofs will allow other people to equationally reason about even more sophisticated systems built on top of pipes and prove higher-level properties by discharging the proofs that I have assembled for them.