Saturday, August 18, 2012

The category design pattern

Functional programming is all the rage these days, but in this post I want to emphasize that functional programming is a subset of a more important overarching programming paradigm: compositional programming.

If you've ever used Unix pipes, you'll understand the importance and flexibility of composing small reusable programs to get powerful and emergent behaviors. Similarly, if you program functionally, you'll know how cool it is to compose a bunch of small reusable functions into a fully featured program.

Category theory codifies this compositional style into a design pattern, the category. Moreover, category theory gives us a precise prescription for how to create our own abstractions that follow this design pattern: the category laws. These laws differentiate category theory from other design patterns by providing rigorous criteria for what does and does not qualify as compositional.

One could easily dismiss this compositional ideal as just that: an ideal, something unsuitable for "real-world" scenarios. However, the theory behind category theory provides the meat that shows that this compositional ideal appears everywhere and can rise to the challenge of messy problems and complex business logic.

This post is just one of many posts that I will write over time where I will demonstrate how to practically use this compositional style in your programs, even for things that may seem like they couldn't possibly lend themselves to compositional programming. This first post starts off by introducing the category as a compositional design pattern.


Categories


I'm going to give a slightly different introduction to category theory than most people give. I'm going to gloss over the definition of what a morphism or an object is and skip over domains and codomains and instead just go straight to composition, because from a programmer's point of view a category is just a compositional design pattern.

Category theory says that for any given category there must be some sort of composition operator, which I will denote (.). The first rule is that this composition operator is associative:
(f . g) . h = f . (g . h) -- Associativity law
This is useful because it means we can completely ignore the order of grouping and write it without any parentheses at all:
f . g . h
Category theory also says that this composition operator must have a left and right identity, which I will denote id. Being an identity means that:
id . f = f  -- Left  identity law

f . id = f  -- Right identity law
The associativity law and the two identity laws are known as the category laws.

Notice that the definition of a category does not define:
  • what (.) is,
  • what id is, or
  • what f, g, and h might be.
Instead, category theory leaves it up to us to discover what they might be.

The brilliance behind the category design pattern is that any composition operator that observes these laws will be:
  • easy to use,
  • intuitive,
  • and free from edge cases.
This is why we try to formulate abstractions in terms of the category design pattern whenever possible.


The function category


Let's define our first category: the category of Haskell functions!
id  :: (a -> a)
id x = x

(.) :: (b -> c) -> (a -> b) -> (a -> c)
(f . g) x = f (g x)
Let's prove to ourselves that these obey the category laws:
-- Left identity: id . f = f
id . f
= \x -> id (f x)
= \x -> f x
= f

-- Right identity: f . id = f
f . id
= \x -> f (id x)
= \x -> f x
= f

-- Associativity: (f . g) . h = f . (g . h)
(f . g) . h
= \x -> (f . g) (h x)
= \x -> f (g (h x))
= \x -> f ((g . h) x)
= \x -> (f . (g . h)) x
= f . (g . h)
Function composition is very easy to use, yet so powerful, precisely because it forms a category! This lets us express complex transformations simply by composing a bunch of reusable parts:
bestLangs :: [Language] -> [Language]
bestLangs = take 3 . sortBy (comparing speed) . filter isCool
Unfortunately, we can't express all of our programs as chains of ordinary functions. I guess we just give up, right? Wrong!


The Kleisli category


The next most common category we encounter on a daily basis is the category of monadic functions, which generalize ordinary functions:
return :: (Monad m) => (a -> m a)

(<=<)  :: (Monad m) => (b -> m c) -> (a -> m b) -> (a -> m c)
Mathematicians call this the "Kleisli" category, and Control.Monad provides both of the above functions.

Notice how the type signatures of return and (<=<) resemble their functional counterparts:
id     ::              (a ->   a)
return :: (Monad m) => (a -> m a)

(.)    ::              (b ->   c) -> (a ->   b) -> (a ->   c)
(<=<)  :: (Monad m) => (b -> m c) -> (a -> m b) -> (a -> m c)
The implementation for (<=<) also resembles the implementation for function composition:
(f  .  g) x = f     (g x)
(f <=< g) x = f =<< (g x)

-- Note (=<<) is the same as (>>=), but with the arguments flipped
Not a coincidence! Monadic functions just generalize ordinary functions and the Kleisli category demonstrates that monadic functions are composable, too. They just use a different composition operator: (<=<), and a different identity: return.

Well, let's assume that category theorists aren't bullshitting us and that (<=<) really is some sort of composition and return really is its identity. If that were true, we'd expect the following laws to hold:
return <=< f = f                   -- Left  identity

f <=< return = f                   -- Right identity

(f <=< g) <=< h = f <=< (g <=< h)  -- Associativity
Well, we already have the definition for (<=<):
(f <=< g) x = f =<< (g x)
... so let's use that definition to expand those laws:
return =<< (f x) = (f x)

f =<< (return x) = f x

(\y -> f =<< (g y)) =<< h x = f =<< (g =<< (h x))
If we simplify those a little and use (>>=) to flip the order of arguments, we get:
m >>= return = m

return x >>= f = f x

m >>= (\y -> g y >>= f) = (m >>= g) >>= f
Look familiar? Those are just the monad laws, which all Monad instances are required to satisfy. If you have ever wondered where those monad laws came from, now you know! They are just the category laws in disguise.

Consequently, every new Monad we define gives us a category for free! Let's try out some of these brave new categories:
-- The Kleisli category for the Maybe monad
lookup  :: k -> [(k, v)] -> Maybe v
maximumByMay :: (a -> a -> Ordering) -> [a] -> Maybe a

bestSuitor :: [(String, [Suitor])] -> Maybe Suitor
bestSuitor = maximumByMay (comparing handsome) <=< lookup "Tall"


-- The Kleisli category for the [] monad
children :: Person -> [Person]

greatGrandChildren :: Person -> [Person]
greatGrandChildren = children <=< children <=< children


-- The Kleisli category for the IO monad
-- * Stolen from /r/haskell today
spawn      ::  IO a  -> IO (IO a)
mapM spawn :: [IO a] -> IO [IO a]
sequence   :: [IO a] -> IO    [a]

concurrentSequence :: [IO a] -> IO [a]
concurrentSequence = sequence <=< mapM spawn
Monads that don't observe these laws are buggy and unintuitive. Don't believe me? Just ask the people who tried to use ListT , which breaks the monad laws.


The Pipe category


Not all categories are functions. I'll use a primitive version of my Pipe type (from the pipes package) with effects removed to simplify the example:
data Pipe a b r
  = Pure r
  | Await (a -> Pipe a b r)
  | Yield b (Pipe a b r)

Pure    r  <-< _          = Pure r
Yield b p1 <-< p2         = Yield b (p1 <-< p2)
Await   f  <-< Yield b p2 = f b <-< p2
p1         <-< Await   f  = Await $ \a -> p1 <-< f a
_          <-< Pure    r  = Pure r

cat = Await $ \a -> Yield a cat
Let's check out what types the compiler infers:
cat   :: Pipe a a r

(<-<) :: Pipe b c r -> Pipe a b r -> Pipe a c r
Those look an awful lot like an identity and composition. I leave it as an exercise for the reader to prove that they actually do form a category:
cat <-< p = p                            -- Right identity

p <-< cat = p                            -- Left  identity

(p1 <-< p2) <-< p3 = p1 <-< (p2 <-< p3)  -- Associativity
Pipes show how more complicated things that don't fit neatly into the functional programming paradigm can still be achieved with a compositional programming style. I won't belabor the compositionality of pipes, though, since my tutorial already does that.

So if you find something that doesn't seem like it could be compositional, don't give up! Chances are that a compositional solution exists just beneath the surface!


Conclusions


All category theory says is that composition is the best design pattern, but then leaves it up to you to define what precisely composition is. It's up to you to discover new and interesting ways to compose things besides just composing functions. As long as the composition operator you define obeys the category laws, you're golden.

Also, I'm really glad to see a resurgence in functional programming (since functions form a category), but in the long run we really need to think about more interesting composition operators than just function composition if we are serious about tackling more complicated problem domains.

Hopefully this post gets you a little bit excited about category theory. In future posts, I will expand upon this post with the following topics:
  • Why the category laws ensure that code is easy, intuitive, and free of edge cases
  • How functors let you mix and match different categories
  • How to use categories to optimize your code
  • How to use categories to simplify equational reasoning

Friday, August 10, 2012

Code Example #1

I've seen beginners on /r/haskell ask for practical code examples so I thought I would share some code from my own work. Today's example will focus on how you can use Haskell to write clear and self-documenting code.

Today my PI gave me the following task:
  • Parse the alpha carbons from a PDB file
  • Scan the sequence using a window of a given size
  • For each window, collect all residues within a certain distance (called the "context")
  • Split the context into contiguous chains
Don't worry if you don't understand the terminology, because I will explain the terms as I go along.

If you want to follow along, use the full code sample in the appendix and download and extract the sample PDB file: 1YU0.pdb.


Parsing


The first thing I had to do was to take a protein structure and parse it into its alpha carbons. The input format is a Protein Data Bank (i.e. PDB) file, which has a well documented file format found here. I'm interested in the ATOM record, which specifies the coordinates of a single atom in the PDB file:
COLUMNS        DATA  TYPE    FIELD        DEFINITION
----------------------------------------------------------------
 1 -  6        Record name   "ATOM  "
 7 - 11        Integer       serial       Atom  serial number.
13 - 16        Atom          name         Atom name.
17             Character     altLoc       Alternate location ind
18 - 20        Residue name  resName      Residue name.
22             Character     chainID      Chain identifier.
23 - 26        Integer       resSeq       Residue sequence numbe
27             AChar         iCode        Code for insertion of 
31 - 38        Real(8.3)     x            Coordinates for X in A
39 - 46        Real(8.3)     y            Coordinates for Y in A
47 - 54        Real(8.3)     z            Coordinates for Z in A
55 - 60        Real(6.2)     occupancy    Occupancy.
61 - 66        Real(6.2)     tempFactor   Temperature  factor.
77 - 78        LString(2)    element      Element symbol, right-
79 - 80        LString(2)    charge       Charge  on the atom.
There are seven fields I have to parse out from an Atom record:
  • resName: Alpha carbons have this set to " CA ". Each residue has exactly one alpha carbon which is often used as a coarse-grained proxy for the entire residue.
  • chainID: Chains are contiguous segments of proteins, labeled by a single letter.
  • resSeq: Residues are the individual building blocks of proteins. Each residue in a chain is sequentially numbered for unique identification.
  • altLoc: Residues can have multiple coordinates, and I arbitrarily pick the first set of coordinates.
  • x, y, and z: The atom's coordinates.
Out of those fields, I only want to retain the chainID, resSeq, x, y, and z fields, so I create a data structure to hold them:
import Numeric.LinearAlgebra -- from the "hmatrix" package

type Point = Vector Double

data Atom = Atom {
    chainID :: Char  ,
    resSeq  :: Int   ,
    coord   :: Point }
    deriving (Show)
Here I use a type synonym to document what I want to use coord for. This has the added advantage that I can use a different Point type later on (such as a length-indexed type), and I need only update the type synonym to update my type signatures.

Now I need a way to parse an atom record, and I'll use attoparsec:
{-# LANGUAGE OverloadedStrings #-}

import Control.Monad
import Data.Attoparsec.Char8 as P hiding (skipWhile)
import Data.Attoparsec (skipWhile)

skip n = void $ P.take n
decimal' = skipSpace >> decimal
double'  = skipSpace >> double

pAtom :: Parser Atom
pAtom = do
    string "ATOM  "        -- Must be an "ATOM  " record
    skip 6
    string " CA "          -- Must be an alpha carbon
    satisfy (inClass " A") -- First conformation (' ' or 'A')
    skip 4
    _chainID <- anyChar
    _resSeq <- decimal'
    skip 1
    x <- double'
    y <- double'
    z <- double'
    skip 26
    endOfLine
    return $ Atom {
        chainID = _chainID,
        resSeq  = _resSeq,
        coord   = fromList [x, y, z] }
That was pretty easy! Notice how I rename P.take to skip as a sort of in-line comment to document my intention to discard the output. I also use void to force it to discard the result and match the intention.

The next step is to create a parser for the entire file that throws out all lines that do not match:
import Control.Applicative
import Data.Maybe

pLine = skipWhile (not . isEndOfLine) *> endOfLine

pPDB :: Parser [Atom]
pPDB = catMaybes <$> (many $     (Just    <$> pAtom)
                             <|> (Nothing <$  pLine) )
Now, the awesome thing about Haskell is that you can very quickly test things out in ghci, so let's fire up the module and see if it works:
*Main> import qualified Data.ByteString.Char8 as B
*Main B> str <- B.readFile "/path/to/1YU0.pdb"
*Main B> parseOnly pPDB str
...
sSeq = 376, coord = fromList [5.274,108.854,3.352]},Atom {chainI
D = 'A', resSeq = 377, coord = fromList [5.643,111.164,6.349]},A
tom {chainID = 'A', resSeq = 378, coord = fromList [4.35,109.87,
9.685]},Atom {chainID = 'A', resSeq = 379, coord = fromList [3.2
33,112.127,12.52]},Atom {chainID = 'A', resSeq = 380, coord = fr
omList [1.9729999999999999,110.268,15.618]}]
Great! It works! (Note: It actually is not correct because it doesn't remove multiple protein models for NMR structures, but I'm glossing over that for now).


Distances


I can't really calculate distance cutoffs without some sort of a distance function:
import Data.Function (on)

dist :: Point -> Point -> Double
dist v1 v2 = norm2 (v1 - v2)

distA :: Atom -> Atom -> Double
distA = dist `on` coord
on is a very handy utility to extend binary functions. More importantly, it makes the code read more like English. The definition for distA reads like saying "Take the distance on the coord field".


Context


Now I have to associate each atom with its neighbors within some distance. Before I write a potentially confusing function, I write out the type to focus my mind:
type Context = [Atom]
type Cutoff  = Double

addContext :: Cutoff -> [Atom] -> [(Atom, Context)]
This function takes a list of atoms, and associates each atom with a list of its neighbors within some distance cutoff. I will be joining these lists of neighbors later on, so really a more efficient data structure would be a Set, but this is good enough for now, and I can optimize it later if necessary.

Also, note the use of a type synonyms as a quick and dirty documentation. A well-chosen type synonym can go a long way towards making your code easy to understand. Also, if I go back and decide to use Set, I only need to change the type synonym definition to:
type Context = Set Atom
.. and all my type signatures will be fixed.

Anyway, let's compute some neighbors:
for = flip fmap

addContext cutoff as = for as $ \a1 ->
    (a1, filter (\a2 -> distA a1 a2 < cutoff) as)
This is definitely not an efficient algorithm (a HashMap-based binning algorithm would work faster), but I will let it slide for now since this is just a simple script. Also, notice that it does not eliminate an atom from its own neighbor list. This is because the atom will be reintroduced later when joining contexts, so I postpone doing this.


Windows


My PI wants a sliding window of 7 residues and a context generated for each window. Rather than manually slide the window in an imperative style, I instead generate the set of all windows:
import Data.List

windows :: Int -> [a] -> [[a]]
windows n = filter (\xs -> length xs == n)
          . map (Prelude.take n)
          . tails
For each given window, I have to join the contexts together and eliminate duplicates. This requires defining what a duplicate is:
duplicate a1 a2 = chainID a1 == chainID a2
               && resSeq  a1 == resSeq  a2
I could have also used on:
duplicate a1 a2 = ((==) `on` chainID) a1 a2
               && ((==) `on` resSeq ) a1 a2
Now I need to specify what it means to join contexts. I write out a type signature to be precise about what I want:
joinContexts :: [(Atom, Context)] -> ([Atom], Context)
joinContexts takes a window of atoms and their individual contexts, and joins all the contexts into one, leaving behind just a list of atoms:
joinContexts ps = (self, nonSelf)
  where self = map fst ps
        context = concat $ map snd ps
        totalContext = nubBy duplicate context
        nonSelf = deleteFirstsBy duplicate totalContext self
Using let or where clauses is another great way to annotate intermediate steps in a function, although that will be pretty obvious to most readers.


Chains


Finally, I have to split each context into chains, defined as consecutive residues with the same chainID. I have a utility function for this purpose that I keep lying around, defined as:
import Data.Monoid
import Data.Ord

sequential a1 a2 = chainID a1     == chainID a2
               && (resSeq  a1 + 1 == resSeq  a2)

type Chain = [Atom]

chains' :: ([Atom] -> [Atom]) -> [Atom] -> [Chain]
chains' c []  = [c [] ]
chains' c [a] = [c [a]]
chains' c (a1:bs@(a2:as))
    | sequential a1 a2 =         chains' c' bs
    | otherwise        = (c' []):chains' id bs
  where c' = c . (a1:)

-- The actual function I use
chains :: [Atom] -> [Chain]
chains = chains' id
       . sortBy (comparing chainID `thenBy` comparing resSeq)
  where thenBy = mappend
I want to draw attention to the sortBy call in the chains function. This takes advantage of two very useful tricks. The first is the comparing function from Data.Ord. The second is the Monoid trick to compose two orderings, giving priority to the first one. Combining the two tricks gives code that reads like English: "Sort by comparing chainID first, then by comparing resSeq".

The last step is that I have to take each window and the chains in its context and pair the window with each chain. If that didn't make sense, I'm pretty sure the type signature would make it clear:
type Window = [Atom]

distribute :: (Window, [Chain]) -> [(Window, Chain)]
This shows how Haskell type signatures can often be "self-documenting" and can say a lot of things that we might have difficulty articulating clearly in words:
-- most polymorphic type :: (a, [b]) -> [(a, b)]
distribute (x, ys) = map ((,) x) ys


Combining Everything


Now I combine all the small functions I wrote into one pipeline:
import Control.Arrow (second)

type WindowSize = Int

pairs :: WindowSize -> Cutoff -> [Atom] -> [(Window, Chain)]
pairs windowSize cutoff =
    concat
  . map (distribute . second chains . joinContexts)
  . windows windowSize
  . addContext cutoff
The above function pipeline clearly illustrates the flow of information and makes it easy to reason about what my code actually does. I could even use it as a high-level specification of my program. Reading from bottom to top it says that you:
  • Attach contexts
  • Partition the sequence into windows
  • For each window:
    • Join the contexts
    • Group the joined context into chains
    • Distribute the window over each chain
  • Concatenate all the results
If you prefer to order things from left to write (or top to bottom), then you can use the (>>>) operation from Control.Category to reverse the direction in which you compose functions:
import Control.Category
import Prelude hiding ((.), id)

pairs windowSize cutoff =
     addContext cutoff
 >>> windows windowSize
 >>> map (joinContexts >>> second chains >>> distribute)
 >>> concat
A compositional style where you chain lots of small functions makes the final overarching program much easier to follow.

All that remains is to quickly test out the code in ghci:
*Main> import qualified Data.ByteString.Char8 as B
*Main B> str <- B.readFile "/path/to/1YU0.pdb"
*Main B> let atoms = parseOnly pPDB str
*Main B> -- I use fmap since parseOnly returns an Either
*Main B> let ps = fmap (pairs 7 12.0) atoms
*Main B> -- How many pairs?
*Main B> fmap length ps
Right 3559
*Main B> -- What is the largest contiguous chain in a context?
*Main B> let size = length . snd
*Main B> let largest = fmap (maximumBy (comparing size)) ps
*Main B> fmap size largest
Right 38
*Main B> -- Which residues did it span?
*Main B> fmap (map resSeq . snd) largest
Right [337,338,339,340,341,342,343,344,345,346,347,348,349,350,
351,352,353,354,355,356,357,358,359,360,361,362,363,364,365,366
,367,368,369,370,371,372,373,374]
*Main B> -- What was the window it matched?
*Main B> fmap (map resSeq . fst) largest
Right [304,305,306,307,308,309,310]
I then load up the protein in PyMOL to inspect the matched ranges and, sure enough, it works correctly:
In the image I highlighted the matched window in yellow and the matched context in purple.


Conclusion


So I hope this shows some of the tricks I use to improve code clarity. I left out a lot of type signatures for brevity because this was a one-off script and I only wanted to draw attention to certain functions, but if I were really serious about fully documenting the code I would include all type signatures.


Appendix: Full code


{-# LANGUAGE OverloadedStrings #-}

import Control.Applicative
import Control.Arrow (second)
import Control.Monad
import Data.Attoparsec.Char8 as P hiding (skipWhile)
import Data.Attoparsec (skipWhile)
import Data.Function (on)
import Data.List
import Data.Maybe
import Data.Monoid
import Data.Ord
import Numeric.LinearAlgebra -- from the "hmatrix" package

type Point = Vector Double

data Atom = Atom {
    chainID :: Char,
    resSeq  :: Int,
    coord   :: Point }
    deriving (Show)

skip n = void $ P.take n
decimal' = skipSpace >> decimal
double'  = skipSpace >> P.double

pAtom :: Parser Atom
pAtom = do
    string "ATOM  "        -- Must be an "ATOM  " record
    skip 6
    string " CA "          -- Must be an alpha carbon
    satisfy (inClass " A") -- First confirmation (' ' or 'A')
    skip 4
    _chainID <- anyChar
    _resSeq <- decimal'
    skip 1
    x <- double'
    y <- double'
    z <- double'
    skip 26
    endOfLine
    return $ Atom {
        chainID = _chainID,
        resSeq  = _resSeq,
        coord   = fromList [x, y, z] }

pLine = skipWhile (not . isEndOfLine) *> endOfLine

pPDB :: Parser [Atom]
pPDB = catMaybes <$> (    many $ (Just <$> pAtom)
                      <|> (Nothing <$ pLine))

dist :: Point -> Point -> Double
dist v1 v2 = norm2 (v1 - v2)

distA :: Atom -> Atom -> Double
distA = dist `on` coord

type Context = [Atom]
type Cutoff  = Double

for = flip fmap

addContext :: Cutoff -> [Atom] -> [(Atom, Context)]
addContext cutoff as = for as $ \a1 ->
    (a1, filter (\a2 -> distA a1 a2 < cutoff) as)

windows :: Int -> [a] -> [[a]]
windows n = filter (\xs -> length xs == n)
          . map (Prelude.take n)
          . tails

duplicate a1 a2 = chainID a1 == chainID a2
               && resSeq  a1 == resSeq  a2

joinContexts :: [(Atom, Context)] -> ([Atom], Context)
joinContexts ps = (self, nonSelf)
  where self = map fst ps
        context = concat $ map snd ps
        totalContext = nubBy duplicate context
        nonSelf = deleteFirstsBy duplicate totalContext self

sequential a1 a2 = chainID a1     == chainID a2
               && (resSeq  a1 + 1 == resSeq  a2)

type Chain = [Atom]

chains' :: ([Atom] -> [Atom]) -> [Atom] -> [Chain]
chains' c []  = [c [] ]
chains' c [a] = [c [a]]
chains' c (a1:bs@(a2:as))
    | sequential a1 a2 =         chains' c' bs
    | otherwise        = (c' []):chains' id bs
  where c' = c . (a1:)

-- The actual function I use
chains :: [Atom] -> [Chain]
chains = chains' id
       . sortBy (comparing chainID `thenBy` comparing resSeq)
  where thenBy = mappend

type Window = [Atom]

distribute :: (Window, [Chain]) -> [(Window, Chain)]
distribute (x, ys) = map ((,) x) ys

type WindowSize = Int

pairs :: WindowSize -> Cutoff -> [Atom] -> [(Window, Chain)]
pairs windowSize cutoff =
    concat
  . map (distribute . second chains . joinContexts)
  . windows windowSize
  . addContext cutoff