Wednesday, December 25, 2013

Equational reasoning

You may have heard that Haskell is "great for equational reasoning", but perhaps you didn't know precisely what that meant. This post will walk through an intermediate example of equational reasoning to show how you can interpret Haskell code by hand by applying substitutions as if your code were a mathematical expression. This process of interpreting code by substituting equals-for-equals is known as equational reasoning.


Equality

In Haskell, when you see a single equals sign it means that the left-hand and right-hand side are substitutable for each other both ways. For example, if you see:

x = 5

That means that wherever you see an x in your program, you can substitute it with a 5, and, vice versa, anywhere you see a 5 in your program you can substitute it with an x. Substitutions like these preserve the behavior of your program.

In fact, you can "run" your program just by repeatedly applying these substitutions until your program is just one giant main. To illustrate this, we will begin with the following program, which should print the number 1 three times:

import Control.Monad

main = replicateM_ 3 (print 1)

replicateM_ is a function that repeats an action a specified number of times. Its type when specialized to IO is:

replicateM_ :: Int -> IO () -> IO ()

The first argument is an Int specifying how many times to repeat the action and the second argument is the action we wish to repeat. In the above example, we specify that we wish to repeat the print 1 action three times.

But what if you don't believe me? What if you wanted to prove to yourself that it repeated the action three times? How would you prove that?


Use the source!

You can locate the source to Haskell functions using one of three tricks:

  • Use Hoogle, which can also search for functions by type signature

  • Use Hayoo!, which is like hoogle, but searches a larger package database and is more strict about matches

  • Use Google and search for "hackage <function>". This also works well for searching for packages.

Using either of those three tricks we would locate replicateM_ here and then we can click the Source link to the right of it to view its definition, which I reproduce here:

replicateM_ n x = sequence_ (replicate n x)

The equals sign means that any time we see something of the form replicateM_ n x, we can substitute it with sequence_ (replicate n x), for any choice of n or x. For example, if we choose the following values for n and x:

n = 3

x = print 1

... then we obtain the following more specific equation:

replicateM_ 3 (print 1) = sequence_ (replicate 3 (print 1))

We will use this equation to replace our program's replicateM_ command with an equal program built from sequence and replicate:

main = sequence_ (replicate 3 (print 1))

The equation tells us that this substitution is safe and preserves the original behavior of our program.

Now, in order to simplify this further we must consult the definition of both replicate and sequence_. When in doubt which one to pick, always pick the outermost function because Haskell is lazy and evaluates everything from outside to in.

In this case, our outermost function is sequence_, defined here:

-- | Combine a list of actions into a single action
sequence_ :: [IO ()] -> IO ()  -- I've simplified the type
sequence_ ms = foldr (>>) (return ()) ms

We will substitute this into our main, noting that ms is replicate 3 (print 1) for the purpose of this substitution:

main = foldr (>>) (return ()) (replicate 3 (print 1))

Now foldr is our outermost function, so we'll consult the definition of foldr:

-- | 'foldr k z' replaces all `(:)`s with `k`
--   and replaces `[]` with `z`
foldr :: (a -> b -> b) -> b -> [a] -> b
foldr k z  []    = z
foldr k z (x:xs) = k x (foldr k z xs)

Here we see two equations. Both equations work both ways, but we don't really know which equation to apply unless we know whether or not the third argument to foldr is an empty list or not. We must evaluate the call to replicate to see whether it will pass us an empty list or a non-empty list, so we consult the definition of replicate:

-- | Create a list containing `x` repeated `n` times
replicate :: Int -> a -> [a]
replicate n x = take n (repeat x)

-- Apply the substitution, using these values:
-- n = 3
-- x = print 1

main = foldr (>>) (return ()) (take 3 (repeat (print 1)))

Boy, this rabbit hole keeps getting deeper and deeper! However, we must persevere. Let's now consult the definition of take:

-- | Take the first `n` elements from a list
take :: Int -> [a] -> [a]
take n _      | n <= 0 =  []
take _ []              =  []
take n (x:xs)          =  x : take (n-1) xs

Here we see three equations. The first one has a predicate, saying that the equality is only valid if n is less than or equal to 0. In our case n is 3, so we skip that equation. However, we cannot distinguish which of the latter two equations to use unless we know whether repeat (print 1) produces an empty list or not, so we must consult the definition of repeat:

-- | `repeat x` creates an infinite list of `x`s
repeat :: a -> [a]
repeat x = x:repeat x  -- Simplified from the real definition

-- Apply the substitution, using these values:
-- x = print 1

main = foldr (>>) (return ()) (take 3 (print 1:repeat (print 1)))

The buck stops here! Although repeat is infinitely recursive, we don't have to fully evaluate it. We can just evaluate it once and lazily defer the recursive call, since all we need to know for now is that the list has at least one value. This now provides us with enough information to select the third equation for take that requires a non-empty list as its argument:

take n (x:xs)          =  x : take (n-1) xs

-- Apply the substitution, using these values:
-- n  = 3
-- x  = print 1
-- xs = repeat (print 1)

main = foldr (>>) (return ()) (print 1:take 2 (repeat (print 1)))

Now we know for sure that foldr's third argument is a non-empty list, so we can select the second equation for foldr:

foldr k z (x:xs) = k x (foldr k z xs)

-- Apply the substitution, using these values:
-- k  = (>>)
-- z  = return ()
-- x  = print 1
-- xs = take 2 (repeat (print 1))

main =
    (>>) (print 1)
         (foldr (>>) (return ()) (take 2 (repeat (print 1))))

-- Note: "(>>) x y" is the same thing as "x >> y"
main = print 1 >> foldr (>>)
                        (return ())
                        (take 2 (repeat (print 1)))

-- Note: "x >> y" is the same thing as "do { x; y }
main = do
    print 1
    foldr (>>) (return ()) (take 2 (repeat (print 1)))

Now our Haskell runtime knows enough information to realize that it needs to print a single 1. The language is smart and will execute the first print statement before further evaluating the call to foldr.

We can repeat this process two more times, cycling through evaluating repeat, take, and foldr, which emit two additional print commands:

-- Evaluate `repeat`
main = do
    print 1
    foldr (>>) (return ()) (take 2 (print 1:repeat (print 1)))

-- Evaluate `take`
main = do
    print 1
    foldr (>>) (return ()) (print 1:take 1 (repeat (print 1)))

-- Evaluate `foldr`
main = do
    print 1
    print 1
    foldr (>>) (return ()) (take 1 (repeat (print 1)))

-- Evaluate `repeat`
main = do
    print 1
    print 1
    foldr (>>) (return ()) (take 1 (print 1:repeat (print 1)))

-- Evaluate `take`
main = do
    print 1
    print 1
    foldr (>>) (return ()) (print 1:take 0 (repeat (print 1)))

-- Evaluate `foldr`
main = do
    print 1
    print 1
    print 1
    foldr (>>) (return ()) (take 0 (repeat (print 1)))

Now something different will happen. Let's revisit our definition of take:

take n _      | n <= 0 =  []
take _ []              =  []
take n (x:xs)          =  x : take (n-1) xs

This time the first equation matches, because n is now equal to 0. However, we also know the third equation will match because repeat will emit a new element. Whenever more than one equation matches Haskell takes the first one by default, so we use the first equation to substitute the call to take with an empty list:

main = do
    print 1
    print 1
    print 1
    foldr (>>) (return ()) []

This then triggers the first equation for foldr:

foldr k z [] = z

-- Therefore:
main = do
    print 1
    print 1
    print 1
    return ()

There it is! We've fully expanded out our use of replicateM_ to prove that it prints the number 1 three times. For reasons I won't go into, we can also remove the final return () and finish with:

main = do
    print 1
    print 1
    print 1

Equational reasoning - Part 1

Notice how Haskell has a very straightforward way to interpret all code: substitution. If you can substitute equals-for-equals then you can interpret a Haskell program on pen and paper. Substitution is the engine of application state.

Equally cool: we never needed to understand the language runtime to know what our code did. The Haskell language has a very limited role: ensure that substitution is valid and otherwise get out of our way.

Unlike imperative languages, there are no extra language statements such as for or break that we need to understand in order to interpret our code. Our printing "loop" that repeated three times was just a bunch of ordinary Haskell code. This is a common theme in Haskell: "language features" are usually ordinary libraries.

Also, had we tried the same pen-and-paper approach to interpreting an imperative language we would have had to keep track of temporary values somewhere in the margins while evaluating our program. In Haskell, all the "state" of our program resides within the expression we are evaluating, and in our case the "state" was the integer argument to take that we threaded through our subsitutions. Haskell code requires less context to understand than imperative code because Haskell expressions are self-contained.

We didn't need to be told to keep track of state. We just kept mindlessly applying substitution and perhaps realized after the fact that what we were doing was equivalent to a state machine. Indeed, state is implemented within the language just like everything else.


Proof reduction

Proving the behavior of our code was really tedious, and we're really interested in proving more generally reusable properties rather than deducing the behavior of specific, disposable programs. However, we spent a lot of effort to prove our last equation, so we want to pick our battles wisely and only spend time proving equations that we can reuse heavily.

So I will propose that we should only really bother to prove the following four equations in order to cover most common uses of replicateM_:

-- replicateM_ "distributes" over addition
replicateM_  0      x = return ()
replicateM_ (m + n) x = replicateM_ m x >> replicateM_ n x

-- replicateM_ "distributes" over multiplication
replicateM_  1      = id
replicateM_ (m * n) = replicateM_ m . replicateM_ n

The last two equations are written in a "point-free style" to emphasize that replicateM_ distributes in a different way over multiplication. If you expand those two equations out to a "point-ful style" you get:

replicateM_ (m * n) x = replicateM_ m (replicateM_ n x)
replicateM   1      x = x

If we could prove these equations, then we could much more easily deduce how replicateM_ behaves for our example program, because we could transform our code more rapidly like this:

replicateM_ 3 (print 1)

-- 3 = 1 + 1 + 1
= replicateM_ (1 + 1 + 1) (print 1)

-- replicateM_ distributes over addition
= do replicateM_ 1 (print 1)
     replicateM_ 1 (print 1)
     replicateM_ 1 (print 1)

-- replicateM_ 1 x = x
= do print 1
     print 1
     print 1

These four master equations are still very tedious to prove all in one go, but we can break this complex task into smaller bite-sized tasks. As a bonus, this divide-and-conquer approach will also produce several other useful and highly reusable equations along the way.

Let's begin by revisiting the definition of replicateM_:

replicateM_ n x = sequence_ (replicate n x)

Like many Haskell functions, replicateM_ is built from two smaller composable pieces: sequence_ and replicate. So perhaps we can also build our proofs from smaller and composable proofs about the individual behaviors of sequence_ and replicate.

Indeed, replicate possesses a set of properties that are remarkably similar to replicateM_:

-- replicate distributes over addition
replicate  0      x = []
replicate (m + n) x = replicate m x ++ replicate n x

-- replicate distributes over multiplication
replicate  1      = return
replicate (m * n) = replicate m <=< replicate n

Again, the last two equations can be expanded to a "point-ful" form:

-- `return x` is `[x]` for lists
replicate 1 x = [x]

-- `(<=<)` is like a point-free `concatMap` for lists
replicate (m * n) x = concatMap (replicate m) (replicate n x)

These four replicate equations are easier to prove than the corresponding replicateM_ equations. If we can somehow prove these simpler equations, then all we have to do is then prove that sequence_ lifts all of the replicate proofs into the equivalent replicateM_ proofs:

-- sequence_ lifts list concatenation to sequencing
sequence_  []        = return ()
sequence_ (xs ++ ys) = sequence_ xs >> sequence_ ys

-- (sequence_ .) lifts list functions to ordinary functions
sequence_ .  return   = id
sequence_ . (f <=< g) = (sequence_ . f) . (sequence_ . g)

-- Point-ful version of last two equations:
sequence_ [x] = x
sequence_ (concatMap f (g x)) = sequence_ (f (sequence_ (g x)))

It might not be obvious at first, but the above four equations for sequence_ suffice to transform the replicate proofs into the analogous replicateM_ proofs. For example, this is how we would prove the first replicateM_ equation in terms of the first replicate equation and the first sequence equation:

replicateM_ 0 x

-- Definition of `replicateM_`
= sequence_ (replicate 0 x)

-- replicate 0 x = []
= sequence_ []

-- sequence_ [] = return ()
= return ()

That was simple. It is only slightly more tricky to prove the second equation:

replicateM_ (m + n) x

-- Definition of `replicateM_`
= sequence_ (replicate (m + n) x)

-- replicate (m + n) x = replicate m x ++ replicate n x
= sequence_ (replicate m x ++ replicate n x)

-- sequence (xs ++ ys) = sequence xs >> sequence ys
= sequence_ (replicate m x) >> sequence_ (replicate n x)

-- Definition of `replicateM_`, in reverse!
= replicateM_ m x >> replicateM_ n x

Notice how the last step of the proof involves using the original replicateM_ equation, but instead substituting from right-to-left!

--  +-- We can substitute this way --+
--  |                                |
--  ^                                v

replicateM_ n x = sequence_ (replicate n x)

--  ^                                v
--  |                                |
--  +--------- Or this way! ---------+

This is a nice example of the utility of bidirectional substitution. We can replace the body of a function with the equivalent function call.

The third replicateM_ equation is also simple to prove in terms of the third replicate and sequence_ equations. I will use the point-ful forms of all these equations for simplicity:

replicateM_ 1 x

-- Definition of `replicateM_`
= sequence_ (replicate 1 x)

-- replicate 1 x = [x]
= sequence_ [x]

-- sequence_ [x] = x
= x

The fourth property is surprisingly short, too:

replicateM_ (m * n) x

-- Definition of `replicateM_`
= sequence_ (replicate (m * n) x)

-- replicate (m * n) x = concatMap (replicate m) (replicate n x)
= sequence_ (concatMap (replicate m) (replicate n x))

-- sequence_ (concatMap f (g x)) = sequence_ (f (sequence_ (g x)))
= sequence_ (replicate m (sequence_ (replicate n x)))

-- Definition of `replicateM_`, in reverse
= replicateM_ m (replicateM_ n x)

Equational reasoning - Part 2

We reduced our proofs of the replicateM_ properties to smaller proofs for replicate and sequence properties. The overhead of this proof reduction is tiny and we can gain the benefit of reusing proofs for replicate and sequence.

As programmers we try to reuse code when we program and the way we promote code reuse is to divide programs into smaller composable pieces that are more reusable. Likewise, we try to reuse proofs when we equationally reason about code and the way we encourage proof reuse is to divide larger proofs into smaller proofs using proof reduction. In the above example we reduced the four equations for replicateM_ into four equations for replicate and four equations for sequence_. These smaller equations are equally useful in their own right and they can be reused by other people as sub-proofs for their own proofs.

However, proof reuse also faces the same challenges as code reuse. When we break up code into smaller pieces sometimes we take things too far and create components we like to think are reusable but really aren't. Similarly, when we reduce proofs sometimes we pick sub-proofs that are worthless and only add more overhead to the entire proof process. How can we sift out the gold from the garbage?

I find that the most reusable proofs are category laws or functor laws of some sort. In fact, every single proof from the previous section was a functor law in disguise. To learn more about functor laws and how they arise everywhere you can read another post of mine about the functor design pattern.


Proof techniques

This section will walk through the complete proofs for the replicate equations to provide several worked examples and to also illustrate several useful proof tricks.

I deliberately write these proofs to be reasonably detailed and to skip as few steps as possible. In practice, though, proofs become much easier the more you equationally reason about code because you get better at taking larger steps.

Let's revisit the equations we wish to prove for replicate, in point-ful form:

replicate  0      x = []
replicate (m + n) x = replicate m x ++ replicate n x

replicate  1      x = [x]
replicate (m * n) x = concatMap (replicate m) (replicate n x)

... where concatMap is defined as:

concatMap :: (a -> [b]) -> [a] -> [b]
concatMap f = foldr ((++) . f) []

Now we must use the same equational reasoning skills we developed in the first section to prove all four of these equations.

The first equation is simple:

replicate 0 x

-- Definition of `replicate`
= take 0 (repeat x)

-- Definition of `take`
= []

-- Proof complete

The second equation is trickier:

replicate (m + n) x

-- Definition of `take`
= take (m + n) (repeat x)

We can't proceed further unless we know whether or not m + n is greater than 0. For simplicity we'll assume that m and n are non-negative.

We then do something analogous to "case analysis" on m, pretending it is like a Peano number. That means that m is either 0 or positive (i.e. greater than 0). We'll first prove our equation for the case where m equals 0:

-- Assume: m = 0
= take (0 + n) (repeat x)

-- 0 + n = n
= take n (repeat x)

-- Definition of `(++)`, in reverse
= [] ++ take n (repeat x)

-- Definition of `take`, in reverse
= take 0 (repeat x) ++ take n (repeat x)

-- m = 0
= take m (repeat x) ++ take n (repeat x)

-- Definition of `replicate`, in reverse
= replicate m x ++ replicate n x

-- Proof complete for m = 0

Then there is the second case, where m is positive, meaning that we can represent m as 1 plus some other non-negative number m':

-- Assume: m = 1 + m'
= take (1 + m' + n) (repeat x)

-- Definition of `repeat`
= take (1 + m' + n) (x:repeat x)

-- Definition of `take`
= x:take (m' + n) (repeat x)

-- Definition of `replicate` in reverse
= x:replicate (m' + n) x

Now we can use induction to reuse the original premise since m' is strictly smaller than m. Since we are assuming that m is non-negative this logical recursion is well-founded and guaranteed to eventually bottom out at the base case where m equals 0:

-- Induction: reuse the premise
= x:(replicate m' x ++ replicate n x)

-- Definition of `(++)`, in reverse
= (x:replicate m' x) ++ replicate n x

-- Definition of `replicate`
= (x:take m' (repeat x)) ++ replicate n x

-- Definition of `take`, in reverse
= take (1 + m') (repeat x) ++ replicate n x

-- Definition of `replicate`, in reverse
= replicate (1 + m') x ++ replicate n x

-- m = 1 + m', in reverse
= replicate m x ++ replicate n x

-- Proof complete for m = 1 + m'

This completes the proof for both cases so the proof is "total", meaning that we covered all possibilities. Actually, that's a lie because really rigorous Haskell proofs must account for the possibility of non-termination (a.k.a. "bottom"). However, I usually consider proofs that don't account for non-termination to be good enough for most practical purposes.

The third replicate law is very straightforward to prove:

replicate 1 x

-- Definition of `replicate`
= take 1 (repeat x)

-- Definition of `repeat`
= take 1 (x:repeat x)

-- Definition of `take`
= x:take 0 (repeat x)

-- Definition of `take`
= x:[]

-- [x] is syntactic sugar for `x:[]`
= [x]

The fourth equation for replicate also requires us to split our proof into two branches. Either n is zero or greater than zero. First we consider the case where n is zero:

replicate (m * n) x

-- Assume: n = 0
= replicate 0 x

-- replicate 0 x = []
= []

-- Definition of `foldr`, in reverse
= foldr ((++) . replicate m) [] []

-- Definition of `concatMap`, in reverse
= concatMap (replicate m) []

-- replicate 0 x = [], in reverse
= concatMap (replicate m) (replicate 0 x)

-- n = 0, in reverse
= concatMap (replicate m) (replicate n x)

-- Proof complete for n = 0

Then we consider the case where n is greater than zero:

replicate (m * n) x

-- Assume: n = 1 + n'
= replicate (m * (1 + n')) x

-- m * (1 + n') = m + m * n'
= replicate (m + m * n') x

-- replicate distributes over addition
= replicate m x ++ replicate (m * n') x

-- Induction: reuse the premise
= replicate m x ++ concatMap (replicate m) (replicate n' x)

-- Definition of `concatMap`
= replicate m x ++ foldr ((++) . replicate m) [] (replicate n' x)

-- Definition of `foldr`, in reverse
= foldr ((++) . replicate m)) [] (x:replicate n' x)

-- Definition of `concatMap`, in reverse
= concatMap (replicate m) (x:replicate n' x)

-- Definition of `replicate`
= concatMap (replicate m) (x:take n' (repeat x))

-- Definition of `take`, in reverse
= concatMap (replicate m) (take (1 + n') (x:repeat x))

-- n = 1 + n', in reverse
= concatMap (replicate m) (take n (x:repeat x))

-- Definition of `repeat`, in reverse
= concatMap (replicate m) (take n (repeat x))

-- Definition of `replicate`, in reverse
= concatMap (replicate m) (replicate n x)

-- Proof complete for n = 1 + n'

Hopefully these proofs give an idea for the amount of effort involved to prove properties of moderate complexity. I omitted the final part of proving the sequence_ equations in the interest of space, but they make for a great exercise.


Equational Reasoning - Part 3

Reasoning about Haskell differs from reasoning about code in other languages. Traditionally, reasoning about code would require:

  • building a formal model of a program's algorithm,
  • reasoning about the behavior of the formal model using its axioms, and
  • proving that the program matches the model.

In a purely functional language like Haskell you formally reason about your code within the language itself. There is no need for a separate formal model because the code is already written as a bunch of formal equations. This is what people mean when they say that Haskell has strong ties to mathematics, because you can reason about Haskell code the same way you reason about mathematical equations.

This is why Haskell syntax for function definitions deviates from mainstream languages. All function definitions are just equalities, which is why Haskell is great for equational reasoning.

This post illustrated how equational reasoning in Haskell can scale to larger complexity through the use of proof reduction. A future post of mine will walk through a second tool for reducing proof complexity: type classes inspired by mathematics that come with associated type class laws.

Thursday, December 19, 2013

Lift error handling with lens-like syntax

One of the main deficiencies of transformers is that the MonadTrans class does not let you lift functions like catchError. The mtl library provides one solution to this problem, which is to type-class catchError and throwError using the MonadError type class.

That's not to say that transformers has no solution for lifting catchError and throwError; it's just really verbose. Each module provides a liftCatch function that you use to lift a catchError function from the base monad to the transformed monad.

For example, Control.Monad.Trans.State provides the following liftCatch function:

Control.Monad.Trans.State.liftCatch
    :: (m (a, s) -> (e -> m (a, s)) -> m (a, s))
    -> StateT s m a -> (e -> StateT s m a) -> StateT s m a

If your monad transformer stack was:

myStack :: (Monad m, Error e) => StateT s (ErrorT e m)

... then you would apply liftCatch directly to catchError to get a catch function that worked at the StateT layer:

Control.Monad.Trans.State.liftCatch catchError
    :: (Monad m, Error e)
    => StateT s (ErrorT e m) a
    -> (e -> StateT s (ErrorT e m) a)
    -> StateT s (ErrorT e m) a

But what if you had a WriterT layer in between like this:

myStack :: (Monad m) => StateT s (WriterT w (ErrorT e m))

You'd have to use liftCatch from the Control.Monad.Trans.Writer module to further lift the catchError to work with this stack:

import Control.Monad.Trans.State as S
import Control.Monad.Trans.Writer as W

S.liftCatch (W.liftCatch catchError)
    :: (Monad m, Error e)
    => StateT s (WriterT w (ErrorT e m)) a
    -> (e -> StateT s (WriterT w (ErrorT e m)) a)
    -> StateT s (WriterT w (ErrorT e m)) a

The advantage of this solution is that type inference works extraordinarily well and it is Haskell98. The disadvantage of this solution is that it is very verbose.

So I will propose a less verbose solution that has a syntax resembling the lens library, although there will be no true lenses involved (I think). Define the following catching function:

catching
 :: (Monad m, Error e)
 => ((ErrorT e m a -> (e -> ErrorT e m a) -> ErrorT e m a) -> r)
 -> r
catching k = k catchError

The k in the above function will be a series of composed liftCatch functions that we will apply to catchError. However, in the spirit of the lens library we will rename these liftCatch functions to be less verbose and more hip and sexy:

import qualified Control.Monad.Trans.Maybe  as M
import qualified Control.Monad.Trans.Reader as R
import qualified Control.Monad.Trans.State  as S
import qualified Control.Monad.Trans.Writer as W
import qualified Pipes.Lift                 as P

state  = S.liftCatch

writer = W.liftCatch

reader = R.liftCatch

maybe  = M.liftCatch

pipe   = P.liftCatchError

Now let's say that we have a monad transformer stack of type:

myStack
    :: (Monad m, Error e)
    => StateT s (MaybeT (ErrorT e m)) r

If I want to catch something at the outer StateT level, I would just write:

myStack = catching (state.maybe) action $ \e -> do
    -- The handler if `action` fails
    ...

If I add more layers to my monad transformer stack, all I have to do is point deeper to the stack by composing more references:

myStack
    :: (Monad m, Error e, Monoid w)
    => Pipe a b (StateT s (WriterT w (MaybeT m))) r
myStack = catching (pipe.state.writer.maybe) action $ \e -> ...

Also, these references are first-class Haskell values, so you can bundle them and manipulate them with your favorite functional tools:

myStack = do
    let total = pipe.state.writer.maybe
    catching total action1 $ \e -> ...
    catching total action2 $ \e -> ...

This approach has a few advantages over the traditional MonadError approach:

  • You get improved type inference
  • You get type errors earlier in development. With MonadError the compiler will not detect an error until you try to run your monad transformer stack.
  • You get better type errors. MonadError errors will arise at a distance where you call runErrorT even though the logical error is probably at the site of the catchError function.
  • Functional references are first class and type classes are not

However, we don't want to lift just catchError. There are many other functions that transformers can lift such as local, listen, and callCC. An interesting question would be whether this is some elegant abstraction that packages all these lifting operations into a simple type in the same way that lenses package getters, setters, traversals, maps, and zooms into a single abstraction. If there were, then we could reuse the same references for catching, listening, and other operations that are otherwise difficult to lift:

listening (state.error) m

passing (maybe.writer) m

catching (writer.maybe) m $ \e -> ...

I have no idea if such an elegant abstraction exists, though, which is why I'm writing this post to solicit ideas.

Friday, November 1, 2013

Test stream programming using Haskell's `QuickCheck`

pipes is a stream programming library built on top of a foundation of basic category theory. The core of the library consists of a set of five categories that all intersect in a single streaming data type and the library's contract is the set of associated category laws.

For example, one such category is the "respond category", which pipes uses to implement for loops and ListT. The two key operations are (/>/), which is the category composition operator, and respond, which is the identity. These must satisfy the following category laws:

respond />/ f = f                  -- Left Identity

f />/ respond = f                  -- Right Identity

(f />/ g) />/ h = f />/ (g />/ h)  -- Associativity

Previously, I described how I manually proved the category laws using equational reasoning, which elicited a strong response from readers that I should also verify the laws empirically using Haskell's QuickCheck library. After all, Haskell's QuickCheck library shines at automated property-based testing. Why not just fire up QuickCheck and test something like:

> quickCheck $ \f -> f />/ respond == f

However, this leads to several problems:

  • You can't compare pipes for Equality
  • QuickCheck can't Show pipes when it discovers a counter-example
  • You can't generate Arbitrary pipes

The latter is the most challenging problem of the three to solve: how do we generate random pipes to test? This has to be done in such a way that it exercises the system and efficiently discovers corner cases.

Randomizing pipes

I decided to try encoding a random pipe as random sequence of operations (specifically Kleisli arrows). This actually works out quite well, because we already have two natural operations built into the core bidirectional API: request and respond. Both of them allow information to pass through twice, as illustrated by the following ASCII diagrams:

request :: Monad m => a' -> Proxy a' a y' y m a

          a'
          |
     +----|----+
     |    |    |
 a' <=====/    |
     |         |
 a  ======\    |
     |    |    |
     +----|----+
          v
          a


respond :: Monad m => a  -> Proxy x' x a' a m a'

          a
          |
     +----|----+
     |    |    |
     |    \ /==== a'
     |     X   |
     |    / \===> a
     |    |    |
     +----|----+
          v 
          a'

If we generate a random sequence of them them we just connect them head to toe. For simplicity, I will just assume that all inputs and outputs are of type Int:

pRandom1 = respond >=> request >=> respond

         Int
          |
     +----|----+
     |    |    |
     |    \ /==== Int
     |     X   |
     |    / \===> Int
     |    |    |
     +----|----+
          v 
         Int
          |
     +----|----+
     |    |    |
Int <=====/    |
     |         |
Int ======\    |
     |    |    |
     +----|----+
          v
         Int
          |
     +----|----+
     |    |    |
     |    \ /==== Int
     |     X   |
     |    / \===> Int
     |    |    |
     +----|----+
          v 
         Int

When we compose proxies using something like (>+>) (the bidirectional generalization of (>->)), we conceptually place these random chains side by side and match inputs with outputs. For example, if we generate the following two random sequences of request and respond:

pRandom1 = respond >=> request >=> respond
pRandom2 = request >=> request

... then this generates the following flow of information, beginning from the top-right Int.

pRandom1 >+> pRandom2:

                                  Int
                                   |
                              +----|----+
                              |    |    |
         Int<======================/    |
          |                   |         |
     +----|----+            /======\    |
     |    |    |           /  |    |    |
     |    \ /==== Int <=\ /   +----|----+
     |     X   |         X         |
     |    / \===> Int ==/ \        v
     |    |    |          |       Int
     +----|----+          |        |
          v               \   +----|----+
         Int               \  |    |    |
          |                 \======/    |
     +----|----+              |         |
     |    |    |            /======\    |
... <=====/    |           /  |    |    |
     |         |          /   +----|----+
... ======\    |          |        v
     |    |    |          |       Int
     +----|----+          |
          v               |
         Int              |
          |               |
     +----|----+          |
     |    |    |          |
     |    \ /==== Int     /
     |     X   |         /
     |    / \===> Int ==/
     |    |    |
     +----|----+
          v 
         Int

This shows how you can easily generate complex flows of information to exercise the API with just a few simple permutations of request and respond.

However, notice how the Int flows through the pipeline without interruption until the pipeline terminates. This means that unless we do something else we will always return the same Int we began with.

So we add two additional actions to our random repertoire which will provide diagnostic information about which paths we took. The first one is inc, which increments the Int flowing through it:

inc :: (Monad m) => Int -> m Int
inc n = return (n + 1)

The second is log, which stores the current value of the Int using a Writer [] base monad and reforwards the Int further along:

log :: Int -> Proxy a' a b' b (Writer [Int]) Int
log n = do
    lift (tell [n])
    return n

Now we can easily generate random Proxys just by assembling random chains of these four actions:

pRandom = log >=> inc >=> request >=> inc >=> inc >=> ...

We can use this same trick to generate arbitrary Servers and Clients (i.e. one-sided Proxys) by just selectively removing either request or respond from the list of permissible actions.

Once we have random Proxys, Servers, and Clients, we can test for equality by testing the values that each side returns and logs when given an initial value of 0:

p1 === p2 = run p1 == run p2
  where
    run p = runWriter (runEffect (p 0))

infix 4 ===

This comparison is pure because the Writer [] base monad is pure, so we can pass it as suitable property that QuickCheck can test. Well, almost...

We also need to be able to Show the randomized values that we selected so that QuickCheck can print any counter-examples it discovers. The solution, though, is pretty simple. We can use an intermediate representation that is just an enumeration. This just stores placeholders for each action in our chain, and these placeholders are Showable:

-- Place holder for a single `Proxy` action
data ProxyStep
    = ProxyRequest
    | ProxyRespond
    | ProxyLog
    | ProxyInc
    deriving (Enum, Bounded)

instance Arbitrary ProxyStep where
    arbitrary = arbitraryBoundedEnum
    shrink _  = []

instance Show ProxyStep where
    show x = case x of
        ProxyRequest -> "request"
        ProxyRespond -> "respond"
        ProxyLog     -> "log"
        ProxyInc     -> "inc"

-- Place holder for a single `Client` action
data ClientStep
    = ClientRequest
    | ClientLog
    | ClientInc
    deriving (Enum, Bounded)

instance Arbitrary ClientStep where
    arbitrary = arbitraryBoundedEnum
    shrink _  = []

instance Show ClientStep where
    show x = case x of
        ClientRequest -> "request"
        ClientLog     -> "log"
        ClientInc     -> "inc"

-- Place holder for a single `Server` action
data ServerStep
    = ServerRespond
    | ServerLog
    | ServerInc
    deriving (Enum, Bounded)

instance Arbitrary ServerStep where
    arbitrary = arbitraryBoundedEnum
    shrink _  = []

instance Show ServerStep where
    show x = case x of
        ServerRespond -> "respond"
        ServerLog     -> "log"
        ServerInc     -> "inc"

Then we tell QuickCheck to generate random lists of these actions which will form our complete Proxys, Clients, or Servers:

correct :: String -> String
correct str = case str of
    [] -> "return"
    _  -> str

-- Place holder for a `Proxy`
newtype AProxy = AProxy { unAProxy :: [ProxyStep] }

instance Arbitrary AProxy where
    arbitrary = fmap AProxy arbitrary
    shrink = map AProxy . shrink . unAProxy

instance Show AProxy where
    show = correct . intercalate " >=> " . map show . unAProxy

-- Place holder for a `Client`
newtype AClient = AClient { unAClient :: [ClientStep] }

instance Arbitrary AClient where
    arbitrary = fmap AClient arbitrary
    shrink = map AClient . shrink . unAClient

instance Show AClient where
    show = correct . intercalate " >=> " . map show . unAClient

-- Place holder for a `Server`
newtype AServer = AServer { unAServer :: [ServerStep] }

instance Arbitrary AServer where
    arbitrary = fmap AServer arbitrary
    shrink = map AServer . shrink . unAServer

instance Show AServer where
    show = correct . intercalate " >=> " . map show . unAServer

Once QuickCheck generates these randomized lists of actions we can convert them to honest-to-god Proxys/Servers/Clients for testing purposes using the following conversion functions:

aProxy
    :: AProxy -> Int -> Proxy Int Int Int Int (Writer [Int]) Int
aProxy = foldr (>=>) return . map f . unAProxy
  where
    f x = case x of
        ProxyRequest -> request
        ProxyRespond -> respond
        ProxyLog     -> log
        ProxyInc     -> inc

aClient :: AClient -> Int -> Client Int Int (Writer [Int]) Int
aClient = foldr (>=>) return . map f . unAClient
  where
    f x = case x of
        ClientRequest -> request
        ClientLog     -> log
        ClientInc     -> inc

aServer :: AServer -> Int -> Server Int Int (Writer [Int]) Int
aServer = foldr (>=>) return . map f . unAServer
  where
    f x = case x of
        ServerRespond -> respond
        ServerLog     -> log
        ServerInc     -> inc

Test-driving pipes

Combined with the previous (===) function for testing equality, I can now test high-level properties like associativity:

main = quickCheck $ \p1' p2' p3' ->
    let p1 = aServer p1'
        p2 = aProxy  p2'
        p3 = aClient p3'
    in  p1 >+> (p2 >+> p3) === (p1 >+> p2) >+> p3

QuickCheck then generates 100 random test cases and verifies that all of them obey the associativity law:

>>> main
++ OK, passed 100 tests.

However, this is still not enough. Perhaps my randomization scheme is simply not exercising corner cases sufficiently well. To really convince myself that I have a good randomization scheme I must try a few negative controls to see how effectively QuickCheck uncovers property violations.

For example, let's suppose that some enterprising young NSA employee were to try to commit a modification to the identity pipe pull' to try to log the second value flowing upstream. We could set up a test case to warn us if the modified pull' function failed to obey the identity law:

main = quickCheck $ \p1' p2' ->
    let p1    = aServer p1'

        -- The maliciously modified `pull` function
        pull' = request >=> respond >=> log >=> pull'

        p2    = aClient p2'
    in  p1 >+> (pull' >+> p2) === p1 >+> p2

QuickCheck would then snoop this out in a hurry and catch the law violation.

*** Failed! Falsifiable (after 12 tests and 6 shrinks):    
respond
request >=> request

Not only does QuickCheck detect violations, but it also goes out of its way to minimize the violation to the minimum reproducing test case. In this case, the way you read the QuickCheck output is that the minimal code necessary to trigger the violation is when:

p1 = respond
p2 = request >=> request

In other words, QuickCheck detects a spurious log on the left-hand side of the equation if p2 requests two values and p1 responds with at least one value. Notice that if p1 did not respond with at least one value then the left pipeline would terminate before p2's second request and avoid triggering the log statement.

QuickCheck can do this kind of minimization because of purity. Since our test case is pure, QuickCheck can safely run it repeatedly as it tries to shrink the counter-example, without having to worry that repeated runs will interfere with each other because of side effects or statefulness.

Here's another example, where we accidentally wrote pull wrong and inserted one extra request too many:

main = quickCheck $ \p1' p3' ->
    let p1    = aServer p1'
        pull' = request >=> respond >=> request >=> pull
        p3    = aClient p3'
    in  p1 >+> (pull' >+> p3) === p1 >+> p3

QuickCheck immediately pulls out this clever counter-example:

*** Failed! Falsifiable (after 15 tests and 8 shrinks):    
respond >=> respond
request >=> request >=> inc

This instructs p1 and p2 to exchange information twice. This triggers pull' to accidentally request one value too many after the second exchange and terminate early before p2 can call inc.

Examples like these give me confidence that permutations on these four actions suffice to build most useful counter-examples. I could probably even narrow it down to three commands by eliminating log, but for now I will keep it.

Conclusion

I believe this is the first example of a useful Arbitrary instance for randomizing a stream programming data type. This allows pipes to test more powerful properties than most stream programming libraries, which would normally settle for randomizing input to the system instead of randomizing the flow of control.

I want to thank Csernik Flaviu Andrei who took the time to write up all the pipes laws into a QuickCheck suite integrated with cabal test. Thanks to him you can now have even greater confidence in the correctness of the pipes library.

The next step would be to machine-check the pipes laws using Agda, which would provide the greatest assurance of correctness.

Saturday, October 12, 2013

An all-atom protein search engine powered by Haskell

This post discusses a Haskell-based project that is the central component of my thesis: a fast, atomic-level structural search engine named "Suns". I will discuss what problem this search engine solves, why I chose Haskell for this project, and what were the pros and cons of my language choice.


The Challenge

Proteins are amazing molecules that control biology on the molecular level. If you are new to cellular biology, you can get an immediate sense for how diverse and sophisticated they are by watching The Inner Life of the Cell, an artistic rendition of several diverse processes controlled by proteins, including:

  • cellular adhesion (how cells recognize and stick to each other)
  • forming a skeleton for the cell
  • assembling and disassembling cellular highways
  • trafficking cargo along those highways

In addition, proteins also:

  • sense the molecular environment
  • perform chemistry
  • are cellular weapons
  • and much, much more!

I am a graduate student in a research lab that designs new proteins for both medical treatments and scientific study. Unfortunately, proteins are significantly harder to design than software. A typical protein can be very chemically complex and tightly integrated:

A protein structure in atomic detail

Proteins are tangled ropes of amino acids, so they are the epitome of spaghetti code. However, it's not as bad as it sounds. Within these tangled balls we see several patterns emerge, both on small, medium, and large scales.

On a larger scale you have "protein domains". I liken these to high-level scripts: self-contained blobs of functionality with little coupling or state. They are often easy to connect together to generate new functionality:

Surface representation of two connected protein domains, colored green and cyan

On the medium scale you have "secondary structure", consisting primarily of alpha helices and beta strands, which form locally repetitive structures. I liken these to C code: low-level, but still reasonably portable. Our lab has historically had success transplanting patterns in secondary structure to new contexts, generating new functionality.

Secondary structure representation of a protein composed primarily of beta strands

On the small scale you have "motifs", small configurations of a few amino acids that frequently occur in natural proteins. I liken these to assembly code: they are very tightly integrated and pack a lot of utility into a small space, but they are not very portable because they have very precise chemical and geometric requirements and little room for customization:

An example motif, with a geometrically well-defined hydrogen-bonding pattern

Our lab needs to design proteins on this small and compact scale, so they use my search engine to discover and incorporate these small motifs into their designed proteins. The search engine overcomes the "portability issues" of these low-level motifs by finding existing motifs that are exactly geometrically and chemically compatible with partial blueprints.

Overview

The most popular interface to the search engine is through PyMOL, a molecular graphics system that lets you visualize protein structures three-dimensionally. You can find a PyMOL search plugin on the official Suns web site, which also includes an extended tutorial for how to use the search plugin.

The search plugin lets you point and click on multiple protein fragments and then click "Search". Typically within less than a second it will stream up to 100 matches to the fragment of interest, superimposed on your original search query. You can then expand on your original query by incorporating new fragments from the search results:

An exmple iterative design process that begins from a guanidinium group and which grows to an N-terminal helix-capping motif. Black dashes highlight fragments incorporated from search results.

The PyMOL plugin is a thin Python-based client that connects to a remote search backend written in Haskell.


Why Haskell?

I initially did not begin this project in Haskell. I actually began it in C. The reason why was that at the time I did not have a solid background in data structures and algorithms, so I would reflexively turn to C for high-performance projects to try to make up for it by improving the constant factors of my code. However, all-atom searching is the kind of problem where the algorithm matters and the bare-metal performance does not. My initial implentation in C attempted to brute force the solution to no avail, taking hours to perform a single search.

The second problem that I encountered was that when I got the first slow draft working I immediately got additional feature requests and I was not prepared to handle them. C is a very brittle language to program in because of manual memory management and no standard library of data structures. However, like many inexperienced programmers I had trouble saying "no" and bit off more than I could chew trying (and failing) to satisfy these feature requests.

At that time I had been dabbling in Haskell in my free time and took interest in the language. Things that intrigued me about the language included:

  • expressive and strong types
  • an elegant mix of imperative and functional style (using monads)
  • high-level idioms and design patterns

Plus, Haskell is garbage collected and has a rich library of data structures, which solved my two main qualms with C, so I decided to rewrite my code base in Haskell.

This rewrite was not a simple translation of the original C code. One of the main reasons that I transitioned was that I wanted to familiarize myself with improved data structures and algorithms. Haskell made it very easy to prototype new code for several reasons:

  • You have a rich set of high-performance purely functional data structures in the containers, unordered-containers, and vector libraries
  • Garbage collection, type inference, and type classes automate common menial tasks
  • Strong static types make it a breeze to refactor code without having to write tests

Consequently, Haskell gave me much greater confidence to rewrite large swaths of code because the compiler caught more of my mistakes and automated more details. This encouraged me to try out new ideas and rapidly gravitate towards a good solution instead of getting complacent with my first working solution. These changes improved performance and now queries took milliseconds instead of hours. At this point, PyMOL became the bottleneck and could not load search results fast enough, so there was no point improving performance further.

Another major advantage of Haskell for my project is parsing. Haskell libraries excel at parsing by necessity: there is a huge pressure to convert untyped textual data to more strongly typed data structures in order to take advantage of the type system. I want to give a special shoutout to attoparsec and aeson, which are fantastic for this purpose.

Another big benefit of programming in Haskell was that my code was remarkably stable and devoid of bugs. This meant that I spent less time maintaining the code and more time adding features.


Problems with Haskell

The first issue I encountered is that if you program in Haskell you need to be prepared to fill in the missing gaps in the library ecosystem. I began at a time when the Haskell ecosystem was beginning to mature, but there were still several important gaps. The foremost of these was a stream programming library, which I needed to write myself because I was unsatisfied with existing iteratee-based solutions, and this eventually evolved into my pipes stream programming library. If you are considering starting a large project in Haskell you need to be prepared to do some library work of your own.

Another issue was numerical code. This is limited almost exclusively to hmatrix (Haskell bindings to the GNU scientific library) and repa (a fast and native Haskell library, but without a standard library of numerical algorithms). hmatrix was problematic because it is GPLv3 licensed, and my university, UCSF, forbids releasing code with a GPLv3 license because of the patent clause. This meant that I was forced to write my own LAPACK bindings for performance-critical code. Haskell's foreign-function interface (FFI) is nice, but not having to use it at all is even nicer.

Next, space leaks cause problems. Space leaks are like the Haskell analog to C's segmentation faults: they require programmer discipline to prevent and they are not easy to debug. With experience I learned how to avoid them, mainly by:

  • Defining data types with strict fields
  • Ensuring all folds are strict in the accumulator

... but this is still a problem because it bites beginners to the language really early. Also, space leaks are the biggest source of program instability, which is a shame because it compromises what would be an otherwise rock-solid stability story for the language.

The next problem is the lack of standardized error-handling idioms. I ended up writing my own library for this, too, called the errors library, but this did not completely solve the problem either. I'm currently collaborating with John Wiegley and Michael Snoyman (of Yesod fame) to try to solve this problem.

Another problem is the pre-existing ecosystem of functionality dependent on lazy IO. I'm not going to name any names, but one library that I depend on uses lazy IO internally and it is still a source of bugs. Lazy IO causes these sort of problems because it does not guarantee ordering of effects and also results in synchronous exceptions being thrown asynchronously in pure code (which is a big no-no in the Haskell world).


Conclusions

The code for this project is open source and hosted on Github. It consists of three separate projects:

You can also learn more about the project on the official Suns website.

People interested in learning Haskell may find the code for the search engine useful to see what a medium-sized Haskell application looks like. Haskell gets a bad reputation for being overly academic and I hope people can see past that to discover an excellent programming language for rapid development with high reliability.

Wednesday, October 9, 2013

How to reimplement the conduit parsing API in 50 lines of pipes code

Michael's recent blog posts highlighted several deficiences of pipes-based parsing. Particularly, he emphasized that it was incompatible with idioms from the core pipes library, and I agree with that assessment. Programming with pipes-parse is a different beast from programming with vanilla pipes and pipes-parse idioms more closely resemble conduit idioms.

Several comments in response to Michael's post asked if one could internally implement conduit on top of pipes, in order to simplify conduit internals. This post answers half that question by showing how to implement conduit sans finalization on top pipes using the tools from pipes-parse.

This code is short, but very dense, so I will walk through the implementation step-by-step, explaining the underlying pipes-parse idioms that I'm using to reconstruct conduit functionality. If you just want to skip to the complete code then go straight to the Appendix at the end of this post.


The Conduit Type

The way you internally represent a conduit-like parser using pipes is the following data type:

import Pipes
import Control.Monad.Trans.State.Strict

newtype ConduitM i o m r = ConduitM
    { unConduitM ::
         forall x . Producer o (StateT (Producer i m x) m) r }

To recap, a ConduitM i o m r has an input of type i and an output of type o, and the output is distinct from the return value, just like pipes.

I model this as a Producer of os that reads from and writes to a Producer of is. The Producer of is is our conduit's upstream input end. awaiting a value will pop an elements off of this Producer and adding back a leftover pushes an element back onto this Producer.

This representation differs from conduit's implementation in one important way: it makes no distinction between leftovers and future input. Both are stored together within the inner Producer. This is one neat trick of reifying future input as a Producer: you now have an obvious place to store leftovers.


Primitives

The next step is to implement await, which is just a thin wrapper around draw from pipes-parse:

type Consumer i m r = forall o . ConduitM i o m r

await :: (Monad m) => Consumer i m (Maybe i)
await = ConduitM $ lift $ liftM f draw
  where
    f (Left  _) = Nothing
    f (Right r) = Just r

However, this doesn't explain what draw is actually doing, so let's refer to its implementation:

draw = do
    p <- get                 -- Retrieve our source of input
    x <- lift (next p)       -- Attempt to pop one element off
    case x of
        Right (a, p') -> do  -- Success: bind the element
            put p'           -- Save the remainder
            return (Right a)
        Left   r      -> do  -- Failure: No more input
            put (return r)
            return (Left  r)

If you are more comfortable with StateT you might prefer the following less verbose form which inlines all the state passing:

draw = StateT $ \p -> do
    x <- next p
    return $ case x of
        Right (a, p') -> (Right a, p'      )
        Left      r   -> (Left  r, return r)

If you think of a Producer a m () as isomorphic to ListT m a, then next is the equivalent of uncons for ListT.

Similarly, we can add elements back to the producer, using leftover, which is just a thin wrapper around unDraw from pipes-parse:

leftover :: (Monad m) => i -> ConduitM i o m ()
leftover i = ConduitM $ lift $ unDraw i

unDraw has a simple implementation:

unDraw a = modify (Pipes.yield a >>)

It just prepends a yield statement onto the Producer. This is the equivalent of cons for ListT.

What about yield? Well, that's exactly the same:

yield :: (Monad m) => o -> ConduitM i o m ()
yield o = ConduitM (Pipes.yield o)

Composition

Now for the interesting part: conduit composition, which has the following type:

(=$=) :: (Monad m)
      => ConduitM a b m ()
      -> ConduitM b c m r
      -> ConduitM a c m r

If we were to replace these ConduitMs with the underlying pipe type, we would get:

(=$=) :: (Monad m)
      => forall x . Producer b (StateT (Producer a m x) m) ()
      -> forall y . Producer c (StateT (Producer b m y) m) r
      -> forall z . Producer c (StateT (Producer a m z) m) r

How do we even begin to approach that?

The key is the runStateP function from Pipes.Lift, which has the following (simplified) type:

runStateP
    :: s -> Producer a (StateT s m) r -> Producer a m (r, s)

Compare this with the type for runStateT:

runStateT :: StateT s m r -> s -> m (r, s)

runStateP differs from runStateT in two ways:

  • runStateP unwraps a StateT buried inside of a pipe

  • runStateP takes arguments in the opposite order from runStateT

runStateP takes care to thread state as it wraps the internal StateT so it behaves just like runStateT. Once you familiarize yourself with how runStateP works, the solution is a matter of type-chasing. In fact, what you will discover is that if you restrict yourself to runStateP, there is only one solution that type-checks.

We begin with two arguments two our operator:

ConduitM pL =$= ConduitM pR = ConduitM $ ...

pL has type:

pL :: forall x . Producer b (StateT (Producer a m x) m) ()

Let's look at what type we get when we unwrap pL using runStateP:

parseL
    :: (Monad m)
    => Producer a m x -> Producer b m ((), Producer a m x)
parseL as = runStateP as pL

This now looks just like a parser combinator. It takes an input stream of values of type a and generates an output stream of values of type b, returning unused input alongside the () return value. We're not interested in this () return value, so we'll use execStateP instead:

parseL
    :: (Monad m)
    => Producer a m x -> Producer b m (Producer a m x)
parseL as = execStateP as pL

Similarly, we'll convert pR to a parser:

parseR
    :: (Monad m)
    => Producer b m y -> Producer c m (r, Producer b m y)
parseR bs = runStateP bs pR

Now what's our goal? We're trying to build a ConduitM a c m r, which is equivalent to the following parser:

parse
    :: (Monad m)
    => Producer a m z -> Producer c m (r, Producer a m z)

This means that we need to introduce a stream of as:

parse as = do
    -- as :: Producer a m x

We can now feed that stream to parseL

parse as = do
    -- as        :: Producer a m x
    -- parseL as :: Producer b m (Producer a m x)

We can then feed that to parseR. This works because parseR is universally quantified in y, which type-checks as Producer a m x:

parse as = do
    -- as  :: Producer a m x
    -- parseL as
    --     :: Producer b m (Producer a m x)
    -- parseR (parseL as)
    --     :: Producer c m (r, Producer b m (Producer a m x))

This is almost what we want. We just need to discard the unused stream of bs:

parse as = do
    (r, pp) <- parseR (parseL as)
    p'      <- lift $ drain pp
    return (r, p')
  where
    drain p = runEffect (for p discard)

If we inline all of that logic, we get the following 5-line implementation of conduit composition:

ConduitM pL =$= ConduitM pR = ConduitM $
    stateP $ \as -> do
          (r, pp) <- runStateP (execStateP as pL) pR
          p'      <- lift $ drain pp
          return (r, p')

This gives a birds-eye view of how conduit composition works. When we compose two conduits, we:

  • Feed the input stream of as to the upstream conduit
  • Feed that to the downstream conduit
  • Discard all leftovers from their intermediate interface

Once we have this composition operator, the right and left fuse are just type-specializations (the same as in conduit):

type Conduit  i m o = ConduitM i  o    m ()
type Source     m o = ConduitM () o    m ()
type Sink     i m r = ConduitM i  Void m r

($=) :: (Monad m) => Source m a -> Conduit a m b -> Source m b
($=) = (=$=)

(=$) :: (Monad m) => Conduit a m b -> Sink b m c -> Sink a m c
(=$) = (=$=)

What about ($$)? That is even simpler:

empty :: (Monad m) => Producer () m r
empty = return () >~ cat  -- equivalent to "forever $ yield ()"

($$) :: (Monad m) => Source m a -> Sink a m b -> m b
ConduitM pL $$ ConduitM pR =
    evalStateT (runEffect pR) (evalStateP empty pL)

This implementation says at a high-level:

  • Feed an unused leftover stream to pL (unused because it's a Source)
  • Feed that to pR
  • There is no step 3

Identity

If that is composition, what is the identity? Why, it's just input from pipes-parse:

idP :: (Monad m) => ConduitM a a m ()
idP = ConduitM (void input)

Neat how that works out. This is equivalent in behavior to:

idP = do
    ma <- await
    case ma of
        Nothing -> return ()
        Just a  -> do
            yield a
            idP

Connect and Resume

Last but not least we need connect and resume. Like I said before, this will ignore finalization concerns, so I will only implement a variation on ($$+) that returns a new Source, rather than a ResumableSource (which is just a Source tagged with a finalizer).

($$+)
    :: (Monad m)
    => Source m a -> Sink a m b -> m (b, Source m a)
ConduitM pL $$+ ConduitM pR = do
    (b, as) <- runEffect $ runStateP (execStateP empty pL) pR
    let as' = ConduitM $ stateP $ \p -> ((), p) <$ as
    return (b, as')

This says:

  • Feed an unused input stream to pL (it's a Source)
  • Feed that to pR
  • Discard pR's inexistent output (it's a Sink)
  • Create a new Source that also ignores its input stream

Conclusion

The purpose of this post is not to suggest that Michael necessarily should implement conduit in terms of pipes, especially since this does not contain finalization code, yet. Rather, I wanted to exhibit that pipes is a powerful tool that you can use to build other abstractions concisely and with less room for error.

Appendix

The minimal test implementation is 50 lines of code, which I've included here:

{-# LANGUAGE RankNTypes #-}

import Control.Applicative ((<$))
import Control.Monad (liftM, void)
import Pipes hiding (await, yield, Consumer)
import qualified Pipes
import Pipes.Lift
import Pipes.Parse

newtype ConduitM i o m r = ConduitM
    { unConduitM :: forall x .
        Producer o (StateT (Producer i m x) m) r }

instance (Monad m) => Monad (ConduitM i o m) where
    return r = ConduitM (return r)
    m >>= f  = ConduitM $ unConduitM m >>= \r -> unConduitM (f r)

instance MonadTrans (ConduitM i o) where
    lift m = ConduitM (lift (lift m))

type Consumer i m r = forall o . ConduitM i  o    m r
type Source     m o =            ConduitM () o    m ()
type Sink     i m r =            ConduitM i  Void m r
type Conduit  i m o =            ConduitM i  o    m ()

await :: (Monad m) => Consumer i m (Maybe i)
await = ConduitM $ lift $ liftM f draw
  where
    f (Left  _) = Nothing
    f (Right r) = Just r

yield :: (Monad m) => o -> ConduitM i o m ()
yield o = ConduitM (Pipes.yield o)

leftover :: (Monad m) => i -> ConduitM i o m ()
leftover i = ConduitM $ lift $ unDraw i

(=$=)
    :: (Monad m)
    => Conduit a m b
    -> ConduitM b c m r
    -> ConduitM a c m r
ConduitM pL =$= ConduitM pR = ConduitM $
    stateP $ \p -> do
          (r, pp) <- runStateP (execStateP p pL) pR
          p'      <- lift $ drain pp
          return (r, p')

drain :: (Monad m) => Producer a m r -> m r
drain p = runEffect (for p discard)

($=) :: (Monad m) => Source m a -> Conduit a m b -> Source m b
($=) = (=$=)

(=$) :: (Monad m) => Conduit a m b -> Sink b m c -> Sink a m c
(=$) = (=$=)

empty :: (Monad m) => Producer () m r
empty = return () >~ cat

($$) :: (Monad m) => Source m a -> Sink a m b -> m b
ConduitM pL $$ ConduitM pR =
    evalStateT (runEffect pR) (evalStateP empty pL)

idP :: (Monad m) => ConduitM a a m ()
idP = ConduitM (void input)

($$+)
    :: (Monad m)
    => Source m a -> Sink a m b -> m (b, Source m a)
ConduitM pL $$+ ConduitM pR = do
    (b, pa) <- runEffect $ runStateP (execStateP empty pL) pR
    let p' = ConduitM $ stateP $ \p -> ((), p) <$ pa
    return (b, p')

Sunday, October 6, 2013

Manual proofs for the `pipes` laws

Out of all of Haskell's streaming libraries, pipes is the only that does not have a test suite. This is because I prefer to use equational reasoning to prove the correctness of pipes. For those new to Haskell, equational reasoning is the use of equations to prove code is correct without ever running it. Haskell makes this possible by strictly enforcing purity.

Equational reasoning works well for a library like pipes because most properties of interest can easily be specified as category theory laws. If you are unfamiliar with category theory, you can read this short primer to learn how it relates to programming.

Every primitive in the pipes library is related to a category in some way:
  • yield and (~>) are the identity and composition operator of a category
  • await and (>~) are the identity and composition operator of a category
  • cat and (>->) are the identity and composition operator of a category
This means that we state the expected behavior for all six primitives just by stating their category laws:
-- Left Identity
yield ~> f = f

-- Right Identity
f ~> yield = f

-- Associativity
(f ~> g) ~> h = f ~> (g ~> h)


-- Left Identity
await >~ f = f

-- Right Identity
f ~> await = f

-- Associativity
(f >~ g) >~ h = f >~ (g >~ h)


-- Left Identity
cat >-> f = f

-- Right Identity
f >-> cat = f

-- Associativity
(f >-> g) >-> h = f >-> (g >-> h)
Category theory laws are like tests: they are properties that we expect our code to obey if we wrote it correctly. These kinds of laws tend to have the nice property that they tend to uncannily summarize many useful properties we would like to know in a remarkably small set of equations. For example, (~>) is defined in terms of for like this:
(f ~> g) x = for (f x) g
When you restate the category laws for yield and (~>) in terms of for, you get the "for loop laws":
-- Looping over a single yield simplifies to function
-- application

for (yield x) $ \e -> do
    f e

= f x


-- Re-yielding every element of a generator returns the
-- original generator

for gen $ \e -> do
    yield e

= gen


-- Nested for loops can become a sequential for loops if the
-- inner loop body ignores the outer loop variable

for gen0 $ \i -> do
    for (f i) $ \j -> do
        g j

= for gen1 $ \j -> do
    g j
  where
    gen1 = for gen0 $ \i -> do
        f i
These are common sense laws that we would expect any language with generators and for loops to obey. Amazingly, the for loop laws emerge naturally from category laws, almost as if they were handed down to us on stone tablets.

For a while I've been writing these proofs out by hand informally in notebooks, but now that I'm writing up my thesis I needed to organize my notes and fill in all the little steps that I would previously skip over. You can find the full set of my notes as an organized markdown file that I've committed to the pipes repository that contains all my manual proofs of the pipe laws.


Caveats


There are several caveats that I need to state about these proofs:

First, there is the obvious caveat that these proofs are still not machine-checked, but that's the next step. Paolo Capriotti has been patiently teaching me how to encode my reasoning into Agda. However, I believe this should be possible and that the proofs are very close to sound, taking into account the following caveats below.

Second, these proofs do not account for bottom (i.e. _|_). My intuition is that the pipes laws do hold even in the presence of bottom, by I am still new to reasoning about bottom, so I omitted that for now. If anybody can point me to a good reference on this I would be grateful.

Third, these proofs use a non-standard coinductive reasoning. To explain why, it's important to note that typical coinductive reasoning in Haskell requires the result to be productive and to guard coinduction behind the production of a constructor. However, not all pipelines are productive, as the following examples illustrate:
infinite = forever $ yield ()

infinite >-> forever await = _|_

for infinite discard = _|_
So instead of using productivity to impose a total ordering on coinductive reasoning I use "consuming" to order my coinductive proofs, meaning that all uses of coinduction are guarded behind the consumption of a constructor. In other words, my proofs do not guarantee that pipe composition makes progress in production, but they do guarantee that pipe composition makes progress in consumption.

Fourth, another departure from typical coinductive proofs is notation. Coinductive reasoning uses bisimilarity to prove "equality". I also use bisimilarity, but notationally I just write it down as a single chain of equations that is bisected by a reference to coinduction. The two halves of the equation chain, above and below the bisection, constitute the two bisimilar halves.

Fifth, proofs notationally discharge other proofs using headers as the namespace. If you see a reference like [Kleisli Category - Left Identity Law - Pointful], that means you should look under top-level header Kleisli Category for sub-header Left Identity Law and sub-sub-header Pointful.

Sixth, I've only proven the laws for the bidirectional API in the Pipes.Core module. However, the category laws for the unidirectional API are very easy to derive from the laws for the bidirectional API, so I leave those as an exercise until I have time to add them myself.


Conclusions


My goal is to eventually machine-check these proofs so that pipes will be a practical and instructive example of a library with a statically verified kernel. Moreover, I hope that these proofs will allow other people to equationally reason about even more sophisticated systems built on top of pipes and prove higher-level properties by discharging the proofs that I have assembled for them.

Friday, September 20, 2013

Perfect streaming using `pipes-bytestring`

pipes-bytestring-1.0.0 is complete, providing pipes utilities for reading and writing effectul streams of ByteStrings. Most people who have been waiting on this library were mainly interested in the following four pipes:

stdin :: MonadIO m => Producer' ByteString m ()

stdout :: MonadIO m => Consumer' ByteString m ()

fromHandle :: MonadIO m => Handle -> Producer' ByteString m ()

toHandle :: MonadIO m => Handle -> Consumer' ByteString m r

However, I delayed releasing pipes-bytestring for so long because there was a separate problem I wanted to solve first, which I like to call the Perfect Streaming Problem.

The Problem

Here is the statement of the Perfect Streaming Problem:

How do you partition an effectful stream into logical units while strictly enforcing an upper bound on memory use?

This problem will resonate with anybody who has ever built a streaming web service: how do you group your stream of bytes into logical units like files or messages? If you naively load each file or message into memory then a malicious input that is one extremely long file or message will take down your server by exhausting all available memory. Most web services work around this by imposing artificial limits on file or message length, or by writing brittle and non-compositional code or protocols that are difficult to understand, debug, and maintain.

However, we don't have to invoke web services to illustrate this problem. Just consider this simpler problem: how do you implement the Unix head utility in Haskell so that it runs in constant space on any input, including infinite inputs without any line boundaries like /dev/zero:

$ head < /dev/zero > /dev/null  # This must run in constant space

Let's keep it simple and assume that our head utility only needs to forward exactly 10 lines from stdin and to stdout.

Before pipes-bytestring, the only way to do this was using lazy IO, like this:

takeLines :: String -> String
takeLines n = unlines . take n . lines

main = do
    str <- getContents
    putStr (takeLines 10 str)

However, pipes and other streaming libraries were built to replace lazy IO so that Haskell programmers could easily reason about the order of effects and decouple them from Haskell's evaluation order. Yet, this simple head challenge remains a perennial problem for streaming libraries. To illustrate why, let's consider how conduit does this, using Data.Conduit.Text.lines:

lines :: Monad m => Conduit Text m Text

This conduit receives a stream of Text chunks of arbitrary size and outputs Text chunks one line long. This sounds reasonable until you feed your program /dev/zero. The lines conduit would then attempt to load an infinitely large chunk of zeroes as the first output and exhaust all memory.

Michael Snoyman already realized this problem and added the linesBounded conduit as a temporary workaround:

linesBounded :: MonadThrow m => Int -> Conduit Text m Text

linesBounded throws an exception when given a line that exceeds the specified length. This is the same workaround that web services use: artificially limit your input stream. I wasn't satisfied with this solution, though. How can I claim that pipes is a mature replacement for lazy IO if lazy IO soundly beats pipes on something as simple as head?

The Solution

I will introduce the solution by beginning from the final code:

Note: There was a last minute bug which I introduced in the lines function before release. Use pipes-bytestring-1.0.1, which fixes this bug, to run this example.
-- head.hs

import Pipes
import Pipes.ByteString
import Pipes.Parse (takeFree)
import Prelude hiding (lines, unlines)

takeLines
    :: (Monad m)
    => Int
    -> Producer ByteString m () -> Producer ByteString m ()
takeLines n = unlines . takeFree n . lines

main = runEffect $ takeLines 10 stdin >-> stdout

Compile and run this to verify that it takes the first ten lines of input for any file:

$ ./head < head.hs
-- head.hs

import Pipes
import Pipes.ByteString
import Pipes.Parse (takeFree)
import Prelude hiding (lines, unlines)

takeLines
    :: (Monad m)
    => Int

... while still handling infinitely long lines in constant space:

$ ./head < /dev/zero >/dev/null  # Runs forever in constant space

To see why this works, first take a look at the type of Pipes.ByteString.lines:

lines
    :: (Monad m)
    => Producer ByteString m ()
    -> FreeT (Producer ByteString m) m ()

Now compare to that the type of Data.ByteString.Lazy.Char8.lines:

lines :: ByteString -> [ByteString]

pipes treats a Producer of ByteStrings as the effectful analog of a lazy ByteString:

-- '~' means "is analogous to"
Producer ByteString m ()  ~  ByteString

Similarly, pipes also treats a FreeT of Producers as the effectful analog of a list of lazy ByteStrings:

FreeT (Producer ByteString m) m ()  ~  [ByteString]

You can think of FreeT as a "linked list" of zero or more Producers, where each Producer's return value contains either the next Producer or the final return value (() in this case). So if a Producer is analogous to a lazy ByteString then a FreeT-based "linked list" of Producers is analogous to a true linked list of lazy ByteStrings.

Each layer of our FreeT is a Producer that streams one line's worth of chunks. This differs from a single chunk one line long because it's still in Producer form so we haven't actually read anything from the file yet. Also, FreeT is smart and statically enforces that we cannot read lines from the next Producer (i.e. the next line) until we finish the first line.

FreeT has a very important property which other solutions do not have: we can use FreeT to sub-divide the Producer into logical units while still keeping the original chunking scheme. Once we have these nice logical boundaries we can manipulate the FreeT using high-level list-like functions such as Pipes.Parse.takeFree:

takeFree
    :: (Functor f, Monad m)
    => Int -> FreeT f m () -> FreeT f m ()

takeFree is the FreeT analog of Prelude.take. We keep the first three f layers of the FreeT and discard the rest. In this case our f is (Producer ByteString m) so if each Producer represents one line then we can take a fixed number of lines.

This works because FreeT (Producer ByteString m) m () is just an ordinary Haskell data type. This data type doesn't contain any actual chunks. Instead, it is just a description of how we might traverse our input stream. When we call takeFree on it we are just saying: "Now that I think about it, I don't intend to traverse as much of the input as I had originally planned".

unlines completes the analogy, collapsing our logical units back down into an unannotated stream of bytes:

unlines
    :: (Monad m)
    => FreeT (Producer ByteString m) m r
    -> Producer ByteString m r

We can then combine lines, takeFree and unlines to mirror the lazy IO approach:

-- Lazy IO:
takeLines n = unlines . take n . lines

-- pipes:
takeLines n = unlines . takeFree n . lines

The main difference is that the intermediate types are larger because we're moving more information in to the type system and, more importantly, moving that same information out of implicit magic.

Conclusions

The next pipes library in development is pipes-text. The main development bottle-neck (besides my own graduation) is that the text library does not export functions to partially decode incomplete ByteString fragments as much as possible without throwing errors. If somebody were to write that up it would speed up the release of pipes-text significantly.

As always, you can follow or contribute to pipes development or just ask questions by joining the haskell-pipes mailing list.

Saturday, September 7, 2013

pipes-4.0: Simpler types and API

I'm announcing pipes-4.0 which greatly simplifies the types and API of the pipes ecosystem. For people new to pipes, pipes is a compositional streaming library that decouples streaming programs into simpler, reusable components.

The purpose behind pipes is to simplify effectful stream programming for both library authors and application writers. Library authors can package streaming components into a highly reusable interface, and application writers can easily connect together streaming components to build correct, efficient, and low-memory streaming programs. For example, the following program connects three reusable components to stream lines from standard input to standard output until the user types "quit":

import Pipes
import qualified Pipes.Prelude as P

main = runEffect $
    P.stdinLn >-> P.takeWhile (/= "quit") >-> P.stdoutLn

pipes distinguishes itself from other stream programming libraries in three main ways:

  • Insistence on elegance, symmetry, and theoretical purity
  • Careful attention to correctness, documentation, and detail
  • Emphasis on a high power-to-weight ratio

Important links


Credits

This release was made possible due to the suggestions and contributions of many people and I want to give special mention to several people:

  • Renzo Carbonara, who is the largest contributor of downstream libraries, building and maintaining pipes-network, pipes-network-tls, pipes-zlib, pipes-binary, pipes-attoparsec and pipes-aeson. He also provided lots of useful feedback on design proposals because of his experience maintaining these libraries.

  • Ben Gamari, who contributed pipes-vector and pipes-interleave.

  • Oliver Charles who contributed to the design of the new pipes-parse library in the process of developing the pipes-tar library.

  • Csernik Flaviu Andrei, who contributed the complete benchmark suite for pipes.

  • Merijn Verstraaten, who contribute the new mtl instances for pipes.

  • Gergely Risko, who fixed a concurrency bug in pipes-concurrency.

  • Mihaly Barasz, who contributed the complete test suite for pipes-concurrency.

  • Tony Day who helped automate the pipes ecosystem and contributed lots of useful feedback on documentation.

  • Aleksey Khudyakov first proposed the idea to remove the old proxy transformer system and outsource the same functionality to monad transformers in the base monad. This change alone accounted for an almost 60% reduction in the library size and the greatest simplification of the types.

  • Johan Tibell proposed the initial idea to provide a simpler unidirectional subset of the API by default. This removed the warts that bidirectionality introduced.

  • Florian Hofmann whose work on pipes-eep led to the discovery of an Arrow instance for push-based pipes.

  • Aristid Breitkreuz and Pierre Radermecker, who both caught important bugs in pipes-parse and pipes-safe.

  • Oliver Batchelor, whose work on integrating pipes with Cloud Haskell improved the design of pipes-safe.

Also, I would like to also thank everybody who provided feedback on the library and its documentation and also contributed code.


Change Log

People familiar with pipes will notice that the biggest change to the library is the elimination of the proxy transformer system. This was made possible by an insight of Aleksey Khudyakov that the proxy transformers were isomorphism to monad transformers in the base monad if you ignored their ability to be unwrapped before the Proxy layer. I later discovered how to unwrap these base monad transformers while preserving the Proxy layer, which made possible the complete elimination of the proxy transformer system.

This had the largest impact on simplifying the API:

  • The number of exported functions dropped to approximately 1/3 of the original size (from about 300+ to 100+)

  • The number of modules dropped to 1/3 of the original size (from 18 to 6)

  • The p type parameter in type signatures disappeared, along with the Proxy type class, which became the concrete Proxy type (which was the old ProxyFast implementation).

  • No need for runIdentityP any longer

The next most important change was a simplification of the API to a unidirectional subset which is the new default. This fixed several warts of the previous API:

  • No more gratuitous () parameters

  • The pipe monad and category now overlap

  • Polymorphic type synonyms can now be used to simplify the types

The original bidirectional functionality still remains intact within the Pipes.Core module. The only difference is that it is not exported by default.

The next important change was the realization that bind in the respond Category (i.e. (//>)) was exactly equivalent to a for loop, so the unidirectional API now uses for as a synonym for (//>) and produces really beautiful for loops.

Other important syntactic changes:

  • The unidirectional API uses yield instead of respond like it was back in pipes-1.0

  • The unidirectional API uses await instead of request like it was back in pipes-1.0

  • runProxy is now runEffect

  • (>->) is the unidirectional pull-based composition, instead of bidirectional composition

Pipes.Prelude has also changed to remove the suffix from all utilities, but is no longer re-exported from the main Pipes module.

The downstream libraries have been updated as well to use the pipes-4.0 API and several of these now have much simpler APIs, too, particularly pipes-safe. I will discuss these libraries in separate library-specific posts later on.


Future Goals

This release is intended to be the last major version bump. The next development priorities are:

  • Stabilize the core pipes library

  • Improve distribution by packaging up pipes for several package managers

  • Continue to build out the pipes ecosystem, particularly dedicated Text and ByteString libraries

The long-term goal is to get pipes into the Haskell platform once the API has proven itself stable and the ecosystem matures.

People interested in learning more about pipes or contributing to development can join the official mailing list.