Friday, August 10, 2012

Code Example #1

I've seen beginners on /r/haskell ask for practical code examples so I thought I would share some code from my own work. Today's example will focus on how you can use Haskell to write clear and self-documenting code.

Today my PI gave me the following task:
  • Parse the alpha carbons from a PDB file
  • Scan the sequence using a window of a given size
  • For each window, collect all residues within a certain distance (called the "context")
  • Split the context into contiguous chains
Don't worry if you don't understand the terminology, because I will explain the terms as I go along.

If you want to follow along, use the full code sample in the appendix and download and extract the sample PDB file: 1YU0.pdb.


Parsing


The first thing I had to do was to take a protein structure and parse it into its alpha carbons. The input format is a Protein Data Bank (i.e. PDB) file, which has a well documented file format found here. I'm interested in the ATOM record, which specifies the coordinates of a single atom in the PDB file:
COLUMNS        DATA  TYPE    FIELD        DEFINITION
----------------------------------------------------------------
 1 -  6        Record name   "ATOM  "
 7 - 11        Integer       serial       Atom  serial number.
13 - 16        Atom          name         Atom name.
17             Character     altLoc       Alternate location ind
18 - 20        Residue name  resName      Residue name.
22             Character     chainID      Chain identifier.
23 - 26        Integer       resSeq       Residue sequence numbe
27             AChar         iCode        Code for insertion of 
31 - 38        Real(8.3)     x            Coordinates for X in A
39 - 46        Real(8.3)     y            Coordinates for Y in A
47 - 54        Real(8.3)     z            Coordinates for Z in A
55 - 60        Real(6.2)     occupancy    Occupancy.
61 - 66        Real(6.2)     tempFactor   Temperature  factor.
77 - 78        LString(2)    element      Element symbol, right-
79 - 80        LString(2)    charge       Charge  on the atom.
There are seven fields I have to parse out from an Atom record:
  • resName: Alpha carbons have this set to " CA ". Each residue has exactly one alpha carbon which is often used as a coarse-grained proxy for the entire residue.
  • chainID: Chains are contiguous segments of proteins, labeled by a single letter.
  • resSeq: Residues are the individual building blocks of proteins. Each residue in a chain is sequentially numbered for unique identification.
  • altLoc: Residues can have multiple coordinates, and I arbitrarily pick the first set of coordinates.
  • x, y, and z: The atom's coordinates.
Out of those fields, I only want to retain the chainID, resSeq, x, y, and z fields, so I create a data structure to hold them:
import Numeric.LinearAlgebra -- from the "hmatrix" package

type Point = Vector Double

data Atom = Atom {
    chainID :: Char  ,
    resSeq  :: Int   ,
    coord   :: Point }
    deriving (Show)
Here I use a type synonym to document what I want to use coord for. This has the added advantage that I can use a different Point type later on (such as a length-indexed type), and I need only update the type synonym to update my type signatures.

Now I need a way to parse an atom record, and I'll use attoparsec:
{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import Data.Attoparsec.Char8 as P hiding (skipWhile)
import Data.Attoparsec (skipWhile)

skip n = void $ P.take n
decimal' = skipSpace >> decimal
double'  = skipSpace >> double

pAtom :: Parser Atom
pAtom = do
    string "ATOM  "        -- Must be an "ATOM  " record
    skip 6
    string " CA "          -- Must be an alpha carbon
    satisfy (inClass " A") -- First conformation (' ' or 'A')
    skip 4
    _chainID <- anyChar
    _resSeq <- decimal'
    skip 1
    x <- double'
    y <- double'
    z <- double'
    skip 26
    endOfLine
    return $ Atom {
        chainID = _chainID,
        resSeq  = _resSeq,
        coord   = fromList [x, y, z] }
That was pretty easy! Notice how I rename P.take to skip as a sort of in-line comment to document my intention to discard the output. I also use void to force it to discard the result and match the intention.

The next step is to create a parser for the entire file that throws out all lines that do not match:
import Control.Applicative
import Data.Maybe

pLine = skipWhile (not . isEndOfLine) *> endOfLine

pPDB :: Parser [Atom]
pPDB = catMaybes <$> (many $     (Just    <$> pAtom)
                             <|> (Nothing <$  pLine) )
Now, the awesome thing about Haskell is that you can very quickly test things out in ghci, so let's fire up the module and see if it works:
*Main> import qualified Data.ByteString.Char8 as B
*Main B> str <- B.readFile "/path/to/1YU0.pdb"
*Main B> parseOnly pPDB str
...
sSeq = 376, coord = fromList [5.274,108.854,3.352]},Atom {chainI
D = 'A', resSeq = 377, coord = fromList [5.643,111.164,6.349]},A
tom {chainID = 'A', resSeq = 378, coord = fromList [4.35,109.87,
9.685]},Atom {chainID = 'A', resSeq = 379, coord = fromList [3.2
33,112.127,12.52]},Atom {chainID = 'A', resSeq = 380, coord = fr
omList [1.9729999999999999,110.268,15.618]}]
Great! It works! (Note: It actually is not correct because it doesn't remove multiple protein models for NMR structures, but I'm glossing over that for now).


Distances


I can't really calculate distance cutoffs without some sort of a distance function:
import Data.Function (on)

dist :: Point -> Point -> Double
dist v1 v2 = norm2 (v1 - v2)

distA :: Atom -> Atom -> Double
distA = dist `on` coord
on is a very handy utility to extend binary functions. More importantly, it makes the code read more like English. The definition for distA reads like saying "Take the distance on the coord field".


Context


Now I have to associate each atom with its neighbors within some distance. Before I write a potentially confusing function, I write out the type to focus my mind:
type Context = [Atom]
type Cutoff  = Double

addContext :: Cutoff -> [Atom] -> [(Atom, Context)]
This function takes a list of atoms, and associates each atom with a list of its neighbors within some distance cutoff. I will be joining these lists of neighbors later on, so really a more efficient data structure would be a Set, but this is good enough for now, and I can optimize it later if necessary.

Also, note the use of a type synonyms as a quick and dirty documentation. A well-chosen type synonym can go a long way towards making your code easy to understand. Also, if I go back and decide to use Set, I only need to change the type synonym definition to:
type Context = Set Atom
.. and all my type signatures will be fixed.

Anyway, let's compute some neighbors:
for = flip fmap

addContext cutoff as = for as $ \a1 ->
    (a1, filter (\a2 -> distA a1 a2 < cutoff) as)
This is definitely not an efficient algorithm (a HashMap-based binning algorithm would work faster), but I will let it slide for now since this is just a simple script. Also, notice that it does not eliminate an atom from its own neighbor list. This is because the atom will be reintroduced later when joining contexts, so I postpone doing this.


Windows


My PI wants a sliding window of 7 residues and a context generated for each window. Rather than manually slide the window in an imperative style, I instead generate the set of all windows:
import Data.List

windows :: Int -> [a] -> [[a]]
windows n = filter (\xs -> length xs == n)
          . map (Prelude.take n)
          . tails
For each given window, I have to join the contexts together and eliminate duplicates. This requires defining what a duplicate is:
duplicate a1 a2 = chainID a1 == chainID a2
               && resSeq  a1 == resSeq  a2
I could have also used on:
duplicate a1 a2 = ((==) `on` chainID) a1 a2
               && ((==) `on` resSeq ) a1 a2
Now I need to specify what it means to join contexts. I write out a type signature to be precise about what I want:
joinContexts :: [(Atom, Context)] -> ([Atom], Context)
joinContexts takes a window of atoms and their individual contexts, and joins all the contexts into one, leaving behind just a list of atoms:
joinContexts ps = (self, nonSelf)
  where self = map fst ps
        context = concat $ map snd ps
        totalContext = nubBy duplicate context
        nonSelf = deleteFirstsBy duplicate totalContext self
Using let or where clauses is another great way to annotate intermediate steps in a function, although that will be pretty obvious to most readers.


Chains


Finally, I have to split each context into chains, defined as consecutive residues with the same chainID. I have a utility function for this purpose that I keep lying around, defined as:
import Data.Monoid
import Data.Ord

sequential a1 a2 = chainID a1     == chainID a2
               && (resSeq  a1 + 1 == resSeq  a2)

type Chain = [Atom]

chains' :: ([Atom] -> [Atom]) -> [Atom] -> [Chain]
chains' c []  = [c [] ]
chains' c [a] = [c [a]]
chains' c (a1:bs@(a2:as))
    | sequential a1 a2 =         chains' c' bs
    | otherwise        = (c' []):chains' id bs
  where c' = c . (a1:)

-- The actual function I use
chains :: [Atom] -> [Chain]
chains = chains' id
       . sortBy (comparing chainID `thenBy` comparing resSeq)
  where thenBy = mappend
I want to draw attention to the sortBy call in the chains function. This takes advantage of two very useful tricks. The first is the comparing function from Data.Ord. The second is the Monoid trick to compose two orderings, giving priority to the first one. Combining the two tricks gives code that reads like English: "Sort by comparing chainID first, then by comparing resSeq".

The last step is that I have to take each window and the chains in its context and pair the window with each chain. If that didn't make sense, I'm pretty sure the type signature would make it clear:
type Window = [Atom]

distribute :: (Window, [Chain]) -> [(Window, Chain)]
This shows how Haskell type signatures can often be "self-documenting" and can say a lot of things that we might have difficulty articulating clearly in words:
-- most polymorphic type :: (a, [b]) -> [(a, b)]
distribute (x, ys) = map ((,) x) ys


Combining Everything


Now I combine all the small functions I wrote into one pipeline:
import Control.Arrow (second)

type WindowSize = Int

pairs :: WindowSize -> Cutoff -> [Atom] -> [(Window, Chain)]
pairs windowSize cutoff =
    concat
  . map (distribute . second chains . joinContexts)
  . windows windowSize
  . addContext cutoff
The above function pipeline clearly illustrates the flow of information and makes it easy to reason about what my code actually does. I could even use it as a high-level specification of my program. Reading from bottom to top it says that you:
  • Attach contexts
  • Partition the sequence into windows
  • For each window:
    • Join the contexts
    • Group the joined context into chains
    • Distribute the window over each chain
  • Concatenate all the results
If you prefer to order things from left to write (or top to bottom), then you can use the (>>>) operation from Control.Category to reverse the direction in which you compose functions:
import Control.Category
import Prelude hiding ((.), id)

pairs windowSize cutoff =
     addContext cutoff
 >>> windows windowSize
 >>> map (joinContexts >>> second chains >>> distribute)
 >>> concat
A compositional style where you chain lots of small functions makes the final overarching program much easier to follow.

All that remains is to quickly test out the code in ghci:
*Main> import qualified Data.ByteString.Char8 as B
*Main B> str <- B.readFile "/path/to/1YU0.pdb"
*Main B> let atoms = parseOnly pPDB str
*Main B> -- I use fmap since parseOnly returns an Either
*Main B> let ps = fmap (pairs 7 12.0) atoms
*Main B> -- How many pairs?
*Main B> fmap length ps
Right 3559
*Main B> -- What is the largest contiguous chain in a context?
*Main B> let size = length . snd
*Main B> let largest = fmap (maximumBy (comparing size)) ps
*Main B> fmap size largest
Right 38
*Main B> -- Which residues did it span?
*Main B> fmap (map resSeq . snd) largest
Right [337,338,339,340,341,342,343,344,345,346,347,348,349,350,
351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366
,367,368,369,370,371,372,373,374]
*Main B> -- What was the window it matched?
*Main B> fmap (map resSeq . fst) largest
Right [304,305,306,307,308,309,310]
I then load up the protein in PyMOL to inspect the matched ranges and, sure enough, it works correctly:
In the image I highlighted the matched window in yellow and the matched context in purple.


Conclusion


So I hope this shows some of the tricks I use to improve code clarity. I left out a lot of type signatures for brevity because this was a one-off script and I only wanted to draw attention to certain functions, but if I were really serious about fully documenting the code I would include all type signatures.


Appendix: Full code


{-# LANGUAGE OverloadedStrings #-}

import Control.Applicative
import Control.Arrow (second)
import Control.Monad
import Data.Attoparsec.Char8 as P hiding (skipWhile)
import Data.Attoparsec (skipWhile)
import Data.Function (on)
import Data.List
import Data.Maybe
import Data.Monoid
import Data.Ord
import Numeric.LinearAlgebra -- from the "hmatrix" package

type Point = Vector Double

data Atom = Atom {
    chainID :: Char,
    resSeq  :: Int,
    coord   :: Point }
    deriving (Show)

skip n = void $ P.take n
decimal' = skipSpace >> decimal
double'  = skipSpace >> P.double

pAtom :: Parser Atom
pAtom = do
    string "ATOM  "        -- Must be an "ATOM  " record
    skip 6
    string " CA "          -- Must be an alpha carbon
    satisfy (inClass " A") -- First confirmation (' ' or 'A')
    skip 4
    _chainID <- anyChar
    _resSeq <- decimal'
    skip 1
    x <- double'
    y <- double'
    z <- double'
    skip 26
    endOfLine
    return $ Atom {
        chainID = _chainID,
        resSeq  = _resSeq,
        coord   = fromList [x, y, z] }

pLine = skipWhile (not . isEndOfLine) *> endOfLine

pPDB :: Parser [Atom]
pPDB = catMaybes <$> (    many $ (Just <$> pAtom)
                      <|> (Nothing <$ pLine))

dist :: Point -> Point -> Double
dist v1 v2 = norm2 (v1 - v2)

distA :: Atom -> Atom -> Double
distA = dist `on` coord

type Context = [Atom]
type Cutoff  = Double

for = flip fmap

addContext :: Cutoff -> [Atom] -> [(Atom, Context)]
addContext cutoff as = for as $ \a1 ->
    (a1, filter (\a2 -> distA a1 a2 < cutoff) as)

windows :: Int -> [a] -> [[a]]
windows n = filter (\xs -> length xs == n)
          . map (Prelude.take n)
          . tails

duplicate a1 a2 = chainID a1 == chainID a2
               && resSeq  a1 == resSeq  a2

joinContexts :: [(Atom, Context)] -> ([Atom], Context)
joinContexts ps = (self, nonSelf)
  where self = map fst ps
        context = concat $ map snd ps
        totalContext = nubBy duplicate context
        nonSelf = deleteFirstsBy duplicate totalContext self

sequential a1 a2 = chainID a1     == chainID a2
               && (resSeq  a1 + 1 == resSeq  a2)

type Chain = [Atom]

chains' :: ([Atom] -> [Atom]) -> [Atom] -> [Chain]
chains' c []  = [c [] ]
chains' c [a] = [c [a]]
chains' c (a1:bs@(a2:as))
    | sequential a1 a2 =         chains' c' bs
    | otherwise        = (c' []):chains' id bs
  where c' = c . (a1:)

-- The actual function I use
chains :: [Atom] -> [Chain]
chains = chains' id
       . sortBy (comparing chainID `thenBy` comparing resSeq)
  where thenBy = mappend

type Window = [Atom]

distribute :: (Window, [Chain]) -> [(Window, Chain)]
distribute (x, ys) = map ((,) x) ys

type WindowSize = Int

pairs :: WindowSize -> Cutoff -> [Atom] -> [(Window, Chain)]
pairs windowSize cutoff =
    concat
  . map (distribute . second chains . joinContexts)
  . windows windowSize
  . addContext cutoff

No comments:

Post a Comment