Comments on Haskell for all: Parsing chemical substructures

Good article. My 2 pence: It is better to use Has...

2013-06-13T14:14:34.331-07:00

Good article. My 2 pence:

It is better to use Haskell for all tasks. C++ can't efficiently use multiple cores and has data race issues. Haskell is slightly slower than C++ but an optimized Haskell can be slower than C++ just by a factor of 2.

Dear Mr.Dalke,
you have pointed that u can employ programmer for a couple of extra months to do the task in C++ than use Haskell. I feel that Haskellers can do another new project/task in that period. C++, Java, scripting languages are a failure now. There is no way to handle concurrency and parallelism in these languages efficiently.
The author of the article rightly pointed that we can throw more hardware than use programmer time.
=>Haskell is the future whether u like it or not. That's it.

I live in Sweden, which is rather far from both Sa...

2012-10-23T09:46:12.935-07:00

I live in Sweden, which is rather far from both San Francisco and Texas. I don't know who Barry Bunin or CDD is. I looked at the web site now; more specifically, the video. It's going to be hard to convince pharmas to trust their data to an external cloud service. I can't judge the usefulness of the tool because I don't know the target audience and their goals. My guess is small pharma/biotech who don't have internal IT support. But they need to be big enough that using Excel/SD files isn't enough. Again, this isn't a market that I know much about.

Oh, don't get me wrong - I wasn't saying t...

2012-10-22T16:38:21.578-07:00

Oh, don't get me wrong - I wasn't saying that it should be a generalized search and neither was I complaining about anything. I was describing how my history got me to follow the link to your essay. Your "non-publishable lessons I learn while programming in Haskell to get people excited about learning the language" just also happens to pick up "people interested in molecular substructure search" ;)

Oh, also, since we seem to have some overlap in in...

2012-10-22T16:36:47.572-07:00

Oh, also, since we seem to have some overlap in interests, I'm curious if you live near the San Francisco bay area or in Texas. I currently work at UCSF, but I will eventually move back to Texas after a couple of years. Also, since you are a consultant that does a lot of substructure stuff, I'm wondering if you know Barry Bunin or his CDD company. That seems like something that would be right up your alley.

Well, it's not a generalized substructure sear...

2012-10-22T16:34:07.076-07:00

Well, it's not a generalized substructure search. The only reason it performs well at all is that it only keeps track of a limited set of substructures (i.e. imidazole, carboxyl, indole ring, peptide bond, carboxamide) that are relevant to the target scientific audience (Structural biologists and protein designers). There is no way I could reasonably use my algorithm to index every possible chemical substructure and if I tried searching the database online without any sort of indexing my substructure search would fall down and cry! Like I said, my algorithm is specialized for protein-centric searches where it focuses on protein-centric applications with very low latency and integration with visualization software. It definitely is not a general purpose drug-design tool.

I generally agree with you. Usually, the Haskell ...

2012-10-22T16:25:56.984-07:00

I generally agree with you. Usually, the Haskell approach to get C-like performance is to try to write batch operations in a C library and then just put a high-level API (i.e. `hmatrix`, which is just a wrapper around `libgsl`, which in turn just wraps `LAPACK`). Usually I switch to Haskell when the cost overhead of calling out to the C FFI is much smaller than the operation it calls, but even then I imagine you can still get cache issues or garbage collection issues that you would not get in a tuned pure C/C++ implementation.

While I do lots of substructure matches, so when I...

2012-10-22T16:11:08.933-07:00

While I do lots of substructure matches, so when I saw the link in reddit I nearly jumped. "Lots" to me means something like 20 billion matches, of 650 substructures against 30 million small molecule compounds, or more recently some work I've been doing to find maximum common substructure profiles for a set of compounds. See http://dalke.developer.se:8888/ for my work-in-progress.

Since I'm a consultant, and don't get paid to write papers, and have to pay either for access to pay-walled-literature or pay for some US$12000 to publish in the open access literature, I tend to write my blog posts as publications. ;)

Thank you for the detailed reply! For the last fe...

2012-10-22T16:05:45.245-07:00

Thank you for the detailed reply!

For the last few years, I have been working in an industry with a different balance of programmer cost versus runtime cost. Our jobs typically take weeks (sometimes months) to complete on clusters of thousands of cores, so a couple of months of programmer time to make something run twice as fast is often worth it.

I had a go at implementing a finite difference application in Haskell. I enjoyed working with Haskell but my initial implementation was painfully slow. That's fair enough: it was my first serious attempt at using the language. So I learned about Data.Vector and some other libraries. However, after a while I found my beautiful, clean Haskell was becoming harder to read and understand than my original C so I gave up.

I came to two conclusions. First, Haskell has a steep learning curve. Simple algorithms are beautiful and easy to understand but realistic programs are hard for a novice to both read and write. Second, high-level functional languages are not applicable to performance-critical applications. Would you agree?

I've compared substructure matchers in cheminf...

2012-10-22T16:02:07.755-07:00

I've compared substructure matchers in cheminformatics and I can say that two implementations, both in C++, can vary by a factor of 8. The difference is in data structure layout, attention to algorithm implementation, etc. I don't have the PDB data set hanging around any more, so I can't do extensive testing. I loaded 2PLV, which is a virus capsid structure containing 7000+ atoms. I can find all 76 phenol-like structure (6 carbons in a ring, with a bond to an oxygen; SMARTS "[#6]~1~[#6]~[#6]~[#6]~[#6]~[#6]~1O") in about 0.56ms. The search for the peptide backbone (SMARTS "CCNC(=O)") finds 1709 hits in 6.6 ms.

This is with OEChem, a high-performance, commercial toolkit for cheminformatics. There's been a lot of time spent optimizing this code base.

Hence it's *extremely* unfair to ask about a comparison to an "optimized C/C++ implementation" because that's like asking how my wife's speed on her recumbent compares to an Tour de France racer - they are for entirely different goals.

BTW, your salary estimate is off by about a factor of two. Your professor pays your salary out of grant funding, and about 1/2 of that funding goes into overhead for the building, library, network, etc. But a purely economic argument isn't appropriate because if it were you would get the academic version of the OEChem toolkit. It is free to academics. It also means that you wouldn't be learning how to implement fundamental algorithms, which is a part of the training you likely need for your future research.

2012-10-22T15:59:04.449-07:00

This comment has been removed by the author.

You are right. I could also inline the definition...

2012-10-22T15:58:18.976-07:00

You are right. I could also inline the definition of parseMotif into parseSubstructure and handwrite the state-passing and list operations manually to maximize performance and combine or trim graph updates. In practice, though, this is not the rate-limiting step of the service, since it only indexes once and then just runs as a search. It goes through 4 PDB structures a second, so even for large data sets it is just a few hours to index at most and then I just serialize the results to an SSD for safe keeping. The search service is the only part that has to be performant and it just loads the results into memory once and returns results on the millisecond time scale from that point onward.

Also, keep in mind I'm only using the full protein data bank just to prove in my paper that it scales to the worst case scenario. The actual search service I will provide will only index a couple thousand high-quality non-redundant structures and that only takes minutes to index.

Also, keep in mind that I'm not even mentioning the parsing algorithm in the paper I'm publishing. The main novelty of my paper is the blazing fast all-atom structural search that is provided as a network service that seamlessly interfaces with molecular visualization software. That's why this is a blog post and not a paper! I mainly use my blog to write about non-publishable lessons I learn while programming in Haskell to get people excited about learning the language.

Yes, bounded valences help a lot. In general there...

2012-10-22T15:38:55.451-07:00

Yes, bounded valences help a lot. In general there may be higher-order valences, but if you're just working with proteins, and skipping, say, organometallic systems, then there shouldn't be any problem. Then again, even with 10-valent systems, I can't see there being much of an effect. In fact, you can likely store them as a vector, and not have the linked-list overhead.

I mostly work in graph space, so spatial cutoffs aren't so relevant. The Ullmann and VF2 algorithms refine the algorithm you describe, in order to improve its efficiency. Ullmann is described at http://www.engr.uconn.edu/~vkk06001/GraphIsomorphism/Papers/Ullman_Algorithm.pdf . You may be interested in some of the work described in http://www.jcheminf.com/content/4/1/13/abstract , although I have disagreements about some of the historical statements they asserted, and I found their work someone shoddy.

BTW, I think there's a problem about using the Parser for this task. If you want to match a single bond then there's no reason to return the mutated graph from parseBond. Instead, it should return the original graph, and parseSubstructure, if it needs to continue the search, should do the mutation itself. As it stands, you're always doing an extra graph mutation. I suspect that the performance difference is measurable, and guess it's about 5%. (Figuring motifs of length 10 atoms, and 1/2 the time spent in graph search and 1/2 in graph edits.)

(Reddit is down so I moved the conversation here.)...

2012-10-22T15:37:49.848-07:00

(Reddit is down so I moved the conversation here.) The conversation there is:

Me:

I've done molecular structure parsing for a long time, and I don't understand what you explained. First, what is your graph data structure? I get that atoms only have a name, and that a bond either exists or doesn't (that is, there's no bond type information; that's reasonable for structural bioinformatics though not for cheminformatics.) Does that mean that each bond link is stored twice? That is, if the atom at index 5 has 8 as an adjacent atom then the atom at index 8 has 5 as an adjacent? That appears to be the case.

(A cheminformatics package would have the atoms pointing to a bond, with bond type and aromaticity information, and the bond type would point back to each of the two atoms it's connected to.)

You do a 'deleteBond', and I couldn't tell if deletion goes as O(n) or O(log n) time. For your code to make sense, it must be the latter. Your molecules are protein sized, but how big are your substructures?

I can't figure out what your substructure patterns look like. You're matching on atom names, and atom names in proteins are unique in a residue, yes? This means there's very little backtracking. Even if you match on element name, there's relatively little call for extensive backtracking. If that's the case then that would explain why parseSubstructure is so short. I usually expect something like Ullmann or VF2 for the matching, and from what I can see, your algorithm for the general case is much less efficient. That is, it looks like you do successive linear scans of the entire bond list, which implies a n*m time, where n is the number of atoms in the molecule and m the number of atoms in the subgraph.

I've a question of the <|> operator. What would happen if you wanted to find if a single substructure containing a linear chain of 12 carbon atoms exists in a buckyball? Would the (<|>) operator end up generating all 118440 matches? Or would you just get the first that it finds? How long does it take?

You replied:

The graph is undirected but since the Graph type is directed I store bonds as two directed edges and similarly delete them in pairs.

The graph is stored as an adjacency array, so the time complexity of deletion is only proportional to the worst case valency, which is 4 for proteins so it is constant (one array lookup, delete an element from a linked list of maximum size 4, repeat for the the other direction).

Edit: Or maybe the correct term is adjacency list (I'm still new to graphs)? It's a node-indexed array where you store linked lists of neighbors, if that clears it up.

Everything you said about substructure matching is correct except that I don't scan the entire bond or atom list. I just scan the atom's neighbors. Again, because valency is capped proteins are the epitome of sparse graphs so this is efficient.

Regarding your last question, Haskell is lazy, so it computes only as many results as I demand. I always take the first result so this prevents the combinatorial blowup. But, it does not guarantee that the first match won't require a lot of backtracking (especially for parsing something like a buckyball). I specialized this algorithm to protein structures and their chemistry. I should have made that clear in the post, however the topic was on the novelty of parsing non-textual things and not about the algorithm's efficiency.

Also, since you are interested in backtracking, Haskell has a very sophisticated backtracking library named LogicT that also handles pruning and fair branching. I didn't use it because it was unnecessary for the small motifs that I index.

Also, I use several other tricks to keep the search performant. The most significant one is that I partition the structure into pages of fixed volume since I have a specified upper bound on the size of each query (In this case, a cube of 15 angstroms). This is probably the optimization you were looking for me to talk about.

Yes, I do. First off, the index building is not r...

2012-10-22T14:19:51.077-07:00

Yes, I do. First off, the index building is not really the rate-limiting step and the search algorithm is what must be fast (and it is VERY fast). However, the indexing algorithm still has good enough performance for our purposes.

I've only benchmarked the indexing on a subset of the protein data bank since my computer only has 1GB of memory free for testing and we're still waiting on a part for the final server that has enough memory to hold the index for the entire data set, so until then I can't benchmark the final data set. However, I have extrapolated based on what my computer can hold, and it indexes 4 structures a second on a single 1GHz core. The graphs of these structures have 2000 nodes on average and they are very sparse with a maximum degree of 4 for any node (since protein atoms form 4 bonds maximum), and 3 is typical. Even when I eventually branch out to all molecules, the maximum outdegree is 6 and that's rare, so I store these graphs as adjacency lists to take advantage of their sparseness.

Extrapolating to the entire protein data bank (i.e. 85,000 structures), the indexing stage would take on the order of 6 hours and 40 GB of memory. However, I'm only indexing the entire data set just to prove that we can (and that's what the server is for). In practice, the search service will index a high-resolution non-redundant data set of 2000 protein structures, which should take about 8 minutes and 1.1 GB of memory and runs rapidly on any ordinary desktop computer.

Once the index is built the search runs very quickly on just one 1 GHz core and responds in milliseconds to queries. It turns out that the front-end (PyMOL, a molecular visualization software that uses a C backend for loading the results) is the rate limiting step, so short of modifying PyMOL I can't really improve performance any more, and it's the fastest molecular visualization software available since it is commercial.

I haven't written the equivalent algorithm in C (the data structures and tests alone would take forever to implement), but for other projects I have written the same code in both Haskell and C. I can usually get within a factor of 3 of C (both for performance and memory usage) just by using high-performance Haskell libraries that are either heavily optimized to generate efficient core or just wrappers around C APIs.

For example, just by defining Storable instances for my data types and using Data.Vector.Storable, I got an in-memory representation that was byte-for-byte identical to what I calculated the equivalent C representation would take. Similarly, the Vector library gives almost C-like array performance and generates really good assembly even for its high-level API.

So if you want an economic assessment of the performance vs. language tradeoff, I cost my professor about $200 a work day ($100 in salary and $100 in tuition+benefits) and this project took exactly four weeks (20 work days), so the price of labor is $8000. I expect that had I written it in C it would have taken at least 3 times as long, so the cost savings is on the order of at least $10,000 and more like $20,000. The cost of our server that can fit the entire PDB data set in memory is $2500. So in this case, the hardware is much cheaper than I am and it's a better use of my professor's money to program more quickly in Haskell and just buy a cheap server.

Do you have any performance statistics for your st...

2012-10-22T13:11:40.908-07:00

Do you have any performance statistics for your structural search engine? For example, how big is a typical Structure graph and how long does your implementation take to search it? Have you compared it to an optimized C/C++ implementation?

I've tinkered with Haskell before but never managed to get anything close to C/C++ performance out of it. However, I've been working on numerical PDE solvers, which is quite a different sort of problem, so maybe Haskell just isn't so suitable for that sort of thing.