Haskell for all: An all-atom protein search engine powered by Haskell

Saturday, October 12, 2013

An all-atom protein search engine powered by Haskell

This post discusses a Haskell-based project that is the central component of my thesis: a fast, atomic-level structural search engine named "Suns". I will discuss what problem this search engine solves, why I chose Haskell for this project, and what were the pros and cons of my language choice.

The Challenge

Proteins are amazing molecules that control biology on the molecular level. If you are new to cellular biology, you can get an immediate sense for how diverse and sophisticated they are by watching The Inner Life of the Cell, an artistic rendition of several diverse processes controlled by proteins, including:

cellular adhesion (how cells recognize and stick to each other)
forming a skeleton for the cell
assembling and disassembling cellular highways
trafficking cargo along those highways

In addition, proteins also:

sense the molecular environment
perform chemistry
are cellular weapons
and much, much more!

I am a graduate student in a research lab that designs new proteins for both medical treatments and scientific study. Unfortunately, proteins are significantly harder to design than software. A typical protein can be very chemically complex and tightly integrated:

A protein structure in atomic detail

Proteins are tangled ropes of amino acids, so they are the epitome of spaghetti code. However, it's not as bad as it sounds. Within these tangled balls we see several patterns emerge, both on small, medium, and large scales.

On a larger scale you have "protein domains". I liken these to high-level scripts: self-contained blobs of functionality with little coupling or state. They are often easy to connect together to generate new functionality:

Surface representation of two connected protein domains, colored green and cyan

On the medium scale you have "secondary structure", consisting primarily of alpha helices and beta strands, which form locally repetitive structures. I liken these to C code: low-level, but still reasonably portable. Our lab has historically had success transplanting patterns in secondary structure to new contexts, generating new functionality.

Secondary structure representation of a protein composed primarily of beta strands

On the small scale you have "motifs", small configurations of a few amino acids that frequently occur in natural proteins. I liken these to assembly code: they are very tightly integrated and pack a lot of utility into a small space, but they are not very portable because they have very precise chemical and geometric requirements and little room for customization:

An example motif, with a geometrically well-defined hydrogen-bonding pattern

Our lab needs to design proteins on this small and compact scale, so they use my search engine to discover and incorporate these small motifs into their designed proteins. The search engine overcomes the "portability issues" of these low-level motifs by finding existing motifs that are exactly geometrically and chemically compatible with partial blueprints.

Overview

The most popular interface to the search engine is through PyMOL, a molecular graphics system that lets you visualize protein structures three-dimensionally. You can find a PyMOL search plugin on the official Suns web site, which also includes an extended tutorial for how to use the search plugin.

The search plugin lets you point and click on multiple protein fragments and then click "Search". Typically within less than a second it will stream up to 100 matches to the fragment of interest, superimposed on your original search query. You can then expand on your original query by incorporating new fragments from the search results:

An exmple iterative design process that begins from a guanidinium group and which grows to an N-terminal helix-capping motif. Black dashes highlight fragments incorporated from search results.

The PyMOL plugin is a thin Python-based client that connects to a remote search backend written in Haskell.

Why Haskell?

I initially did not begin this project in Haskell. I actually began it in C. The reason why was that at the time I did not have a solid background in data structures and algorithms, so I would reflexively turn to C for high-performance projects to try to make up for it by improving the constant factors of my code. However, all-atom searching is the kind of problem where the algorithm matters and the bare-metal performance does not. My initial implentation in C attempted to brute force the solution to no avail, taking hours to perform a single search.

The second problem that I encountered was that when I got the first slow draft working I immediately got additional feature requests and I was not prepared to handle them. C is a very brittle language to program in because of manual memory management and no standard library of data structures. However, like many inexperienced programmers I had trouble saying "no" and bit off more than I could chew trying (and failing) to satisfy these feature requests.

At that time I had been dabbling in Haskell in my free time and took interest in the language. Things that intrigued me about the language included:

expressive and strong types
an elegant mix of imperative and functional style (using monads)
high-level idioms and design patterns

Plus, Haskell is garbage collected and has a rich library of data structures, which solved my two main qualms with C, so I decided to rewrite my code base in Haskell.

This rewrite was not a simple translation of the original C code. One of the main reasons that I transitioned was that I wanted to familiarize myself with improved data structures and algorithms. Haskell made it very easy to prototype new code for several reasons:

You have a rich set of high-performance purely functional data structures in the containers, unordered-containers, and vector libraries
Garbage collection, type inference, and type classes automate common menial tasks
Strong static types make it a breeze to refactor code without having to write tests

Consequently, Haskell gave me much greater confidence to rewrite large swaths of code because the compiler caught more of my mistakes and automated more details. This encouraged me to try out new ideas and rapidly gravitate towards a good solution instead of getting complacent with my first working solution. These changes improved performance and now queries took milliseconds instead of hours. At this point, PyMOL became the bottleneck and could not load search results fast enough, so there was no point improving performance further.

Another major advantage of Haskell for my project is parsing. Haskell libraries excel at parsing by necessity: there is a huge pressure to convert untyped textual data to more strongly typed data structures in order to take advantage of the type system. I want to give a special shoutout to attoparsec and aeson, which are fantastic for this purpose.

Another big benefit of programming in Haskell was that my code was remarkably stable and devoid of bugs. This meant that I spent less time maintaining the code and more time adding features.

Problems with Haskell

The first issue I encountered is that if you program in Haskell you need to be prepared to fill in the missing gaps in the library ecosystem. I began at a time when the Haskell ecosystem was beginning to mature, but there were still several important gaps. The foremost of these was a stream programming library, which I needed to write myself because I was unsatisfied with existing iteratee-based solutions, and this eventually evolved into my pipes stream programming library. If you are considering starting a large project in Haskell you need to be prepared to do some library work of your own.

Another issue was numerical code. This is limited almost exclusively to hmatrix (Haskell bindings to the GNU scientific library) and repa (a fast and native Haskell library, but without a standard library of numerical algorithms). hmatrix was problematic because it is GPLv3 licensed, and my university, UCSF, forbids releasing code with a GPLv3 license because of the patent clause. This meant that I was forced to write my own LAPACK bindings for performance-critical code. Haskell's foreign-function interface (FFI) is nice, but not having to use it at all is even nicer.

Next, space leaks cause problems. Space leaks are like the Haskell analog to C's segmentation faults: they require programmer discipline to prevent and they are not easy to debug. With experience I learned how to avoid them, mainly by:

Defining data types with strict fields
Ensuring all folds are strict in the accumulator

... but this is still a problem because it bites beginners to the language really early. Also, space leaks are the biggest source of program instability, which is a shame because it compromises what would be an otherwise rock-solid stability story for the language.

The next problem is the lack of standardized error-handling idioms. I ended up writing my own library for this, too, called the errors library, but this did not completely solve the problem either. I'm currently collaborating with John Wiegley and Michael Snoyman (of Yesod fame) to try to solve this problem.

Another problem is the pre-existing ecosystem of functionality dependent on lazy IO. I'm not going to name any names, but one library that I depend on uses lazy IO internally and it is still a source of bugs. Lazy IO causes these sort of problems because it does not guarantee ordering of effects and also results in synchronous exceptions being thrown asynchronously in pure code (which is a big no-no in the Haskell world).

Conclusions

The code for this project is open source and hosted on Github. It consists of three separate projects:

You can also learn more about the project on the official Suns website.

People interested in learning Haskell may find the code for the search engine useful to see what a medium-sized Haskell application looks like. Haskell gets a bad reputation for being overly academic and I hope people can see past that to discover an excellent programming language for rapid development with high reliability.

12 comments:

haroldcarrOctober 12, 2013 at 5:33 PM
Thanks for open-sourcing you code. I'll be studying it to help me learn Haskell.
ReplyDelete
Replies
Jan StolarekOctober 13, 2013 at 4:25 AM
Nice! So, when have you started to implement this project in Haskell?

I'm also curious about the algorithmic part of your work. How have your Haskell algorithms changed compared to the ones you used in C?
ReplyDelete
Replies
houshuangOctober 13, 2013 at 8:16 AM
Great overview, and also list of strengths and (current) weaknesses of Haskell. Also amazing your contributions to address those weaknesses!
ReplyDelete
Replies
salfumanOctober 14, 2013 at 12:11 AM
Awesome (i'm yet waiting for your book :D)
ReplyDelete
Replies
UnknownOctober 16, 2013 at 3:38 PM
Great Job!! I'm working in a package for doing ab-initio simulations and molecular dynamics in Haskell, but I work in no man's land, physicists and chemists barely known what functional programming is, and I have found myself in great trouble to explain quantum chemistry to the Haskell community. You have managed to integrate both worlds, could you please share your experience about trying to join both fields?
ReplyDelete
Replies

Add comment