Sunday, December 30, 2012

The Continuation Monad

The continuation monad is one of the least appreciated monads and in this post I hope to motivate when to use it. This post will first motivate continuations in general and then motivate them in their specific capacity as monads.


Continuations


A Haskell continuation has the following type:
newtype Cont r a = Cont { runCont :: (a -> r) -> r }
A continuation takes a function of type (a -> r) and generates an r, where r can sometimes be a fixed value like Int or IO ().

For example, I might write a long-running process that spawns an action every time the user enters a line of input:
onInput :: (String -> IO ()) -> IO ()
        -- i.e. Cont (IO ()) String
onInput f = forever $ do
    str <- getLine
    f str
You will recognize this idiom if you've ever used frameworks with callbacks. We supply the framework with a function (i.e. a continuation) and the framework uses that function to do its job.


"Complete me Later"


You generally use continuations when you are programming something, but you want somebody else to complete it. Common reasons include:
  • You are programming a framework with callbacks that users supply
  • You are defining a custom map engine for game players to program
  • You are lazy
I'll use the following hypothetical code segment as an example:
unitAttack :: Target -> IO ()
unitAttack target = do
    swingAxeBack 60
    valid <- isTargetValid target
    if valid
    then ??? target
    else sayUhOh
Let's imagine you have to package up and compile this code for somebody else (say, a fellow colleague) to use later, but it won't compile yet because you still have the unspecified ??? function. What do you do?

Like all good programming, the best solution is the laziest one. We punt and take the incomplete behavior as a parameter so that whoever finishes the function later on can complete the function by passing the specified behavior in:
unitAttack :: Target -> (Target -> IO ()) -> IO ()
unitAttack target todo = do
    swingAxeBack 60
    valid <- isTargetValid target
    if valid
    then todo target
    else sayUhOh
Problem solved! Notice how the right hand side of the type signature resembles the shape of our Cont type. If we just add a newtype, we can wrap it in Cont ourselves:
unitAttack :: Target -> Cont (IO ()) Target
unitAttack target = Cont $ \todo -> do
    swingAxeBack 60
    valid <- isTargetValid target
    if valid
    then todo target
    else sayUhOh
... or, even better, we can use ContT instead. The benefit of ContT is that it is also a monad transformer, which comes in handy. ContT has the exact same Monad instance as Cont, so they are otherwise interchangeable:
newtype ContT r m a = ContT { runContT :: (a -> m r) -> m r }

unitAttack :: Target -> ContT () IO Target
unitAttack target = ContT $ \todo -> do
    swingAxeBack 60
    valid <- isTargetValid target
    if valid
    then todo target
    else sayUhOh
This is great because now somebody else can "continue" where we left off (thus the name: continuations). They would just define the missing function:
damageTarget :: Target -> IO ()
... and then supply it to our continuation to complete it:
runContT (unitAttack target) damageTarget :: IO ()

Variable Arguments


Our strategy works well if we have exactly one hole in our function, but what if we have two holes in our function, each of which takes a different argument?
unitAttack :: Target -> IO ()
unitAttack target = do
    ???_1 60
    valid <- isTargetValid target
    if valid
    then ???_2 target
    else sayUhOh
Well, we might try to accept two continuations:
unitAttack
 :: Target -> (Int -> IO ()) -> (Target -> IO ()) -> IO ()
unitAttack target todo1 todo2 = do
    todo1 60
    valid <- isTargetValid target
    if valid
    then todo2 target
    else sayUhOh
... but that no longer cleanly fits into our Cont type, which expects exactly one continuation.

Fortunately, there is a clean and general solution. Just define a data type that wraps both possible arguments in a sum type, and just define a single continuation that accepts this sum type:
data Hole = Swing Int | Attack Target

unitAttack :: Target -> ContT () IO Hole
unitAttack target = ContT $ \k -> do
    k (Swing 60)
    valid <- isTargetValid target
    if valid
    then k (Attack target)
    else sayUhOh
Each constructor acts as a place-holder that signals to the continuation which hole it is currently filling. Then somebody else can continue where we left off and just write:
damage    :: Target -> IO ()
swingBack :: Int -> IO ()

continue :: Hole -> IO ()
continue (Swing  n) = swingBack n
continue (Attack t) = damage t

runCont (unitAttack target) continue :: IO ()
This trick generalizes to n holes with variable arguments per hole. Just define a type with n constructors, one for each hole, where each constructor stores whatever arguments that particular continuation will need:
data Hole = Hole1 Arg1 Arg2 | Hole2 | Hole3 Arg3 | Hole4

Algebraic Data Types


I want to digress for a moment to talk about algebraic data types. If you are not interested, skip to the next section.

It turns out we can elegantly derive the above trick for multiple holes. Type algebra says that if we squint then we can translate the following type constructors to algebraic operators and derive equivalent types from simple algebraic manipulations:
Either a b  <=>  a + b
(a, b)      <=>  a * b
a -> b      <=>  b ^ a
That means that if we have a function with two continuations:
(a1 -> r) -> ((a2 -> r) -> r)
... we just translate it to the equivalent algebraic expression:
(r ^ (r ^ a2)) ^ (r ^ a1)
... and then we can derive equivalent representations just by using the rules of algebra:
  (r ^ (r ^ a2)) ^ (r ^ a1)
= r ^ ((r ^ a2) * (r ^ a1))
= r ^ (r ^ (a2 + a1))
... then if we translate that back to the equivalent type, we get:
(Either a2 a1 -> r) -> r
... which is exactly the trick described in the previous section.

Similarly, if we have more than one argument to a continuation:
(a -> b -> r) -> r
... we can find an equivalent single-argument form using type algebra:
  r ^ ((r ^ a) ^ b)
= r ^ (r ^ (a * b))
... which transforms back to:
((a, b) -> r) -> r
So type algebra tells us the obvious: uncurry the continuation if it needs a single argument.


The Continuation Monad


So far that explains what continuations are useful for, but it does not explain what the continuation Monad is useful for.

I firmly believe that the way to a Monads heart is through its Kleisli arrows, and if you want to study a Monads "purpose" or "motivation" you study what its Kleisli arrows do.

So rather than study the Monad instance for Cont, let's instead just study the shape of the Cont Kleisli arrow and infer what it does from its type alone:
  a -> Cont r b
~ a -> (b -> r) -> r    -- Expand the definition of Cont
~ (b -> r) -> (a -> r)  -- Flip the arguments
In other words, we take a function that handles bs and transform it into a function that handles as.

This suggests a basic starting intuition for the continuation monad: we transform handlers.

Let's build on that intuition by revisiting our previous example:
unitAttack :: Target -> ContT () IO Target
unitAttack target = ContT $ \todo -> do
    swingBack 60
    valid <- isTargetValid target
    if valid
    then todo target
    else sayUhOh
We need to supply a completion function of type:
handler :: Target -> IO ()
We could complete this function ... or we could half-ass it and leave our work incomplete:
halfAssedCompletion :: Target -> IO ()
halfAssedCompletion target = do
    registerUnitBeingAttacked
    playDamageSound
    ??? 40  -- So close...
This means we essentially created a new continuation with a slightly smaller hole:
halfAssedCompletion :: Target -> ContT () IO Int
halfAssedCompletion target = ContT $ \todo -> do
    registerUnitBeingAttacked
    playDamageSound
    todo 40
This is a Kleisli arrow! That means we can compose it with our previous Kleisli arrow:
unitAttack >=> halfAssedCompletion :: Target -> ContT () IO Int
This composition substitutes in halfAssedCompletion for each hole we left in the unitAttack function. However, halfAssedCompletion left smaller Int holes of its own that somebody else now has to finish up.

Notice how now we originally needed a handler of type:
handler :: Target -> IO ()
... but now we only need a smaller handler of type:
newHandler :: Int -> IO ()
... in other words, halfAssedCompletion acts as an intermediary that transforms handlers of type (Int -> IO ()) into handlers of type (Target -> IO ()).

The Cont monad is all about chaining these kinds of partial completions together until all the holes are finally filled. You could use this abstraction to complete a project in stages and seamlessly hand off work from person to person whenever circumstances require a change in maintainer before completing the project. Alternative, you can use this to condense the callback API of a framework into a single point of entry.


The Kleisli Category


Earlier I said that the key to a monad is its Kleisli arrows. The reason why is that Kleisli arrows are morphisms in the Kleisli category, where (>=>) is Kleisli arrow composition:
(>=>) :: (Monad m) => (a -> m b) -> (b -> m c) -> (a -> m c)
(f >=> g) x = f x >>= g
.. and return is the identity:
return :: (Monad m) => a -> m a
Like all categories, the Kleisli category must obey the category laws:
return >=> f = f                   -- Left identity

f >=> return = f                   -- Right identity

(f >=> g) >=> h = f >=> (g >=> h)  -- Associativity
Things that obey these laws have nice properties. For example, it guarantees that you can reason about each Kleisli arrow in a composition chain in isolation. Each Kleisli arrow's behavior is completely determined by its input (i.e. domain) and output (i.e. codomain). So let's think about how that modularity translates to the Cont Kleisli category.

When you switch maintainers, you don't have to give the next maintainer a bunch of holes sprawled out over a large code base like this:
largeProgram = do
    ...
    x <- ???_1 y
    ...
    ???_2 horseTheyRodeInOn
    ...
    spawn ???_29 foo
Instead you can unify all the holes using a single callback that accepts a single type (the "codomain") unifying all the holes you left:
largeProgram :: () -> ContT () IO Hole
largeProgram () = ContT $ \k -> do
    ...
    x <- k (Hole1 y)
    ...
    k Hole2
    ...
    k (Hole29 spawn foo)
This give the next person a single point of entry to continue from, because now they only have to write a Kleisli arrow that handles a single Hole input which encompasses all the previous holes:
nextContribution :: Hole -> ContT () IO NextHole
nextContribution currHole = ConT $ \nextHole -> case currHole of
    Hole1 y -> ... -- Fill first hole
    Hole2   -> ... -- Fill second hole
    ...
    Hole29 spawn foo -> ... -- File 29th hole
Then you just use Kleisli composition to connect your code contribution:
largeProgram >=> nextContribution
This cleanly modularizes the first person's contribution so that you can hermetically seal it off from subsequent contributions. By repeating this process, each subsequent contribution to the code base becomes its own modular and composable Kleisli arrow, cleanly separated from other contributions:
alice'sWork :: a -> ContT r m b 
bob'sWork   :: b -> ContT r m c
carlo'sWork :: c -> ContT r m d 

engine = alice'sWork >=> bob'sWork >=> carlo'sWork
 :: a -> ContT r m d

customMap :: d -> ContT r m e

completeGame = engine >=> customMap
 :: a -> ContT r m e
This is why frameworks and game custom map makers all use continuations to delimit the interface between the company's code and the user's code. The continuation monad is all about establishing air-tight code boundaries, both internally within a project, and externally for user facing APIs. This lets you isolate responsibilities if you can separate each code contribution into its own Kleisli arrow.


Callback Hell


Frameworks are the canonical example of separating responsibilities, where the framework writer provides some code, but the user is expected to fill in the gap with callbacks of their own. This often results in callback hell in frameworks that take this principle to the extreme, like Node.js.

But it doesn't have to be that way. The continuation monad teaches us that we can always condense a sprawling API filled with callbacks into a single callback that takes a single argument. Even better, we get monadic syntactic sugar for composing multiple layers of callbacks.

I'll use the GLUT package as an example, which requires several callbacks like:
type ReshapeCallback = Size -> IO ()

type VisibilityCallback = Visibility -> IO ()

type WindowStateCallback = WindowState -> IO ()

type CloseCallback = IO ()

-- there are more, but I'll stop here
Instead, we can wrap GLUT's multiple callbacks into a uniform ContT API:
glut :: () -> ContT () IO Hole

data Hole
   = Reshape Size
   | Visible Visibility
   | Window WindowState
   | Close
   ...
Now the end user has a single entry point to the GLUT monad, so they can now complete the framework in a single function:
userCallbacks :: Hole -> ContT () IO a
userCallbacks hole = ContT $ \_ -> case hole of
    Reshape size -> ... -- Handle reshapes
    Visibility v -> ... -- Handle visibility switches
    Window ws    -> ... -- Handle changes to window state
    Close        -> ... -- Handle window closing
    ...
Moreover, they can now just compose their code with the glut framework:
glut >=> userCallbacks :: () -> ContT () IO a

The Buck Stops Here


How do we know when we are done and there are no continuations left? Well, let's see what type the compiler infers if we have no more holes and never use the continuation:
>>> let done = ContT $ \_ -> return ()
>>> :t done
done :: Monad m => ContT () m a
It says the return type is polymorphic, meaning that there is no hole left to fill. The above function just inserts return () in all holes and calls it a day. We can even prove a chain of continuations is done if its final return value type-checks as Void, the empty type:
absurd :: Void -> a  -- from the "void" package

run :: (Monad m) => ContT r m Void -> m r
run c = runContT c absurd
run only accepts completed programs that have no holes left. We can use run for our previous GLUT example, since the final user callback handler leaves no unfinished holes:
run ((glut >=> userCallbacks) ()) :: IO ()

Conclusion


I hope this post inspires people to use the continuation monad to structure and modularize code completion boundaries. The continuation monad naturally arises at the boundaries between programmers and cleanly abstracts away callback hell into a simple and uniform interface with a single entry point.

Wednesday, December 12, 2012

pipes-3.0 - A simpler, unified API

Introduction


I'm releasing pipes-3.0, which significantly simplifies the entire library. This release began as the misnamed 2.6 branch of my Github repository, but then I finally cleanly solved the polymorphic constraints issue and this solution unlocked several features that I could finally implement.

The large change log includes:
  • Type-classing the entire Proxy API
  • Offering both fast and correct base implementations
  • Fixing type synonyms
  • Unifying the Pipes and Proxy APIs and dropping Frames
  • Adding the PFunctor type class (functors over proxies)
  • Performance improvements
  • Expanded laws and guarantees
  • Smaller dependencies
As always, if you want to learn how to use pipes, just consult the tutorial.


Code


I can demonstrate a lot of the new improvements just by taking the take' function from the pipes-2.4 announcement post and showing the difference before and after the new changes.

Here is the previous version:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}

import Control.Monad.Trans.Class (MonadTrans(lift))
import Control.Proxy
import Control.Proxy.Trans.Either as E
import Control.Proxy.Trans.State as S
import Data.Monoid ((<>))
import Data.Text as T

newtype ParseP p a' a b' b m r = ParseP {
    unParseP :: StateP Text (EitherP Text p) a' a b' b m r }
    deriving (Monad, MonadTrans, Channel)

take' :: (Monad m, Monad (p () Text b' b m), Interact p,
          Channel p)
      => Int -> () -> ParseP p () Text b' b m Text
take' n () = ParseP go where
    go = do
        s <- S.get
        if (T.length s < n)
        then do
            s' <- liftP $ liftP $ request ()
            S.put (s <> s')
            go
        else do
            let (h, t) = T.splitAt n s
            S.put t
            return h
Here is the new version:
{-# LANGUAGE GeneralizedNewtypeDeriving #-}

-- Smaller import list
import Control.Proxy
import Control.Proxy.Trans.Either as E
import Control.Proxy.Trans.State as S
import Data.Monoid ((<>))
import Data.Text as T

newtype ParseP p a' a b' b m r = ParseP {
    unParseP :: StateP Text (EitherP Text p) a' a b' b m r }
    deriving (Monad, MonadTrans, Proxy)

instance ProxyTrans ParseP where
    liftP = liftP . liftP

--                 +-- Cleaner constraints
--                 |
--                 v
take' :: (Monad m, Proxy p)
      => Int -> () -> Consumer (ParseP p) Text m Text
--                    ^         ^
--                    |         |
--                    +-- Type synonyms work
--                        with extensions!

take' n () = ParseP go where
    go = do
        s <- S.get
        if (T.length s < n)
        then do
            s' <- request () -- No more liftP!
            S.put (s <> s')
            go
        else do
            let (h, t) = T.splitAt n s
            S.put t
            return h
You can already see several differences:
  • Type synonyms work with everything, so you can always use them now.
  • Constraints are MUCH simpler and significantly more polymorphic
  • Proxy transformers can now request and respond natively
  • Control.Proxy now imports useful things from other libraries (like lift).
However, this release includes several more great changes as well.


Polymorphic Constraints


The big issue that held back the library in the wake of the 2.4 release was that I could not type class the request and respond operations. The pipes library really only has three fundamental operations:
  • (>->)
  • request
  • respond
Unfortunately, I could not type class request and respond to work with proxy transformers, meaning any standard libraries would be substantially crippled.

The problem boiled down to my inability to write polymorphic type class contexts like this:
instance  -- Not valid Haskell
    (forall a' a b' b m .
        (Monad m) => Monad (p a' a b' b m), Proxy p)
    => Proxy (EitherP e p) where
    ...
The solution (in hindsight) is simple. You define a higher-kinded type class and you copy the lower-kinded class's functions:
class MonadP p where
    (?>=) :: (Monad m)
          => p a' a b' b m r1
          -> (r1 -> p a' a b' b m r2)
          -> p a' a b' b m r2
    return_P :: (Monad m) => r -> p a 'a b' b m r

-- In practice, 'MonadP' is part of 'Proxy'
Then you can write Haskell98 contexts like:
instance (MonadP p, Proxy p) => Proxy (EitherP e p) where ...
... but at the same time you can still keep the original lower-kinded type class for the user API:
instance (MonadP p, Monad m)
    => Monad (EitherP e p a' a b' b m) where ...
This method has one drawback, which is that sometimes you need to embed the code in a newtype if all the type variables are polymorphic. The IdentityP proxy transformer fits this purpose perfectly, meaning that if you don't use any proxy transformers, you just write:
myPipe () = runIdentityP $ pipeCode
The () and runIdentityP are the only syntactic noise now. The () is essential, as I will describe below, and the runIdentityP avoids OverlappingInstances when all the type variables are polymorphic.

With this change I could then lift request and respond over proxy transformers and consolidate the entire Proxy API into the single Proxy type class. this solution is entirely Haskell98 and no longer requires FlexibleContexts. If you use pipes already you will discover that your new type signatures will become gorgeous now.


Type-classed API


pipes-3.0 type classes the entire Proxy API using the Proxy type class, which defines the following three essential functions:
  • (>->)
  • request
  • respond
The entire library implement EVERYTHING using these three functions and the upcoming prompt finalization library also is built entirely on these three functions. I can safely say that you don't need to use any other functions to write fully featured pipes libraries, which dramatically simplifies the API.

I type-classed all the utilities which means that I can now offer two base proxy implementations:
  • ProxyFast The fast proxy implementation from pipes-2.5
  • ProxyCorrect The correct proxy implementation from pipes-2.4
So now you can choose which implementation you prefer. Control.Proxy exports ProxyFast implementation by default, but you can easily switch to the correct implementation. Because all pipes code builds on the Proxy type class you are completely free to pick whichever implementation you prefer and they will both work transparently with all standard libraries.

However, these utilities don't just work with both base proxy implementations; they also now work as proxy transformers, too! This means you can now use these utilities seamlessly within any feature set without any lifting.


Laws


I briefly touched on the proxy laws in my previous post: Concurrency = Lists of Kleisli Arrows. However, there were two mistakes with those sets of laws:
  • The double request law was incorrect
  • The laws were insufficiently general
Fixing the double request law was an interesting challenge, but it was ultimately very rewarding and led to the correct formulation of the laws in terms of two symmetric proxy composition categories:
  • Pull-based category: (>->) and idT (the current one)
  • Push-based category: (>~>) and coidT (the new one)
The pull-based category composes proxies blocked on respond and returns a new proxy blocked on respond. The push-based category composes proxies blocked on request and returns a new proxy blocked on request.

The proxy laws say that these two categories are perfectly symmetric and that your code's behavior never changes under the following functor, which is an isomorphism between the two categories:
  • (>~>) to (>->)
  • (>->) to (>~>)
  • request to respond
  • respond to request
This means that the choice of composition operator is arbitrary, and I standardize on using the pull-based (>->) operator because the library already uses it and also matches the lazy demand-based expectations of Haskell programmers. Therefore, you never have to use the (>~>) operator, but I include in the library as a theoretical curiosity.

I've already spoken about the practical benefit of centralizing on proxies (duplex channels), but I also standardized on proxies because of their elegance, even more elegant than even pipes. For example:
  • The Proxy laws have a purely categorical formulation. Pipes do not permit such a formulation.
  • Proxy Kleisli arrows form morphisms in at least 5 categories, all of which have the same shape. Pipes are morphisms in just two categories (composition and Kleisli categories), and their morphisms do not overlap.
  • You can assemble most useful proxies purely using composition operators: i.e. (>=>) and (>->), unlike Pipes, which have an impedance mismatch between the two categories.
  • You cannot implement a push-based category using Pipes.

Type synonyms


Now type synonyms work incredibly well after I parametrized them to take the Proxy instance as a type parameter. This means that you never have to give them up when things get hairy and if you loved your good old-fashioned and simple Pipe, Consumer and Producer types, you can now use them transparently with the entire proxy ecosystem and they mesh perfectly with everything.

There is another reason I really like proxies: The Producer type synonym forbids requests. You can't do that with the unidirectional pipe implementation. I special case the Producer type synonym to close the upstream end to forbid communication upstream, so now you have a stronger guarantee.

I also went back to using () and C for type synonyms instead of Rank2Types to universally quantify unused ends. Rank2Types caused all sorts of problems when you tried to write proxy combinators that accepted pipes as arguments, whereas the simple approach always works and gives clearer types. The only disadvantage is that if you want to insert a pipe with a blocked end within an open one (like a Producer within a Pipe), you must explicitly reopen the end using the unitD and unitU helper functions.

Also, I chose not to include the initial parameter of proxies in the type synonym. The main reason is that you then lose the type synonym any time you define anything other than a composable proxy, which defeats all the other gains I just mentioned. Also, proxies lie along 5 categories, and special-casing the type synonyms for just one of those categories is a bad idea, especially when at least one of those other categories (the Kleisli category) is commonly used as well.

I also empirically experimented with both approaches in my own projects, and I can pretty confidently state that the type synonyms should never include the initial parameter. It does mean that your type signatures will be slightly longer, but it's worth it. Also, it makes it much easier for people less familiar with the pipes library to consume pipes utilities because they don't need to remember what the initial argument for any given type synonym is supposed to be.


API Consolidation


I went to a great deal of trouble to clean up the fragmentation that I began when I released Frames. Frames are gone and I've merged Pipe functionality into Proxys, so the library now only has three operations you ever need to use:
  • (>->)
  • request
  • respond
There is now just one way to compose things, something which a lot of users have requested.

I've deprecated the old Pipe API and included a "Control.Proxy.Pipe" module which helps users transition from Pipes to Proxys. This module not only provides the transition API but also has detailed instructions for how to upgrade your library while still keeping the niceties of the Pipes API.

In addition, I've consolidated all the tutorials into a single coherent tutorial in Control.Proxy.Tutorial with a single logical progression. The tutorial is long, but describes a great deal of the pipe idioms I've collected through my extensive usage of pipes in my own projects. More importantly, the tutorial collects all of the documentation for the library into a single location again so that users don't need to hunt over several tutorial modules or several blog posts of mine to learn about some key concept.

I also spent a lot of effort into guiding people towards a single natural coding style for pipes to make it easier for pipes users to read each other's code. The API exported from Control.Proxy has very little redundancy, and many old redundant API features (like the old await and yield) require you to pay an "import tax" to the tune of one extra import if you want to deviate from the new pipes "style guide". The tutorial also spends a lot of time talking about common idioms to encourage a uniform coding style.


Performance


The pipes library now includes rewrite rules that rewrite unoptimized user code into the equivalent hand-tuned code. Despite type-classing the entire standard library these rewrite rules fire very robustly without any assistance on your part. Just enjoy the extra performance. There are no performance regressions in this release and all the standard library utilities perform just as damn fast as they did in pipes-2.5, despite now being fully polymorphic and written using ordinary do notation.

Like always, "Control.Proxy.Prelude.Base" provides tons of examples of how to write idiomatic and high-performance pipes code, so feel free to consult that if you are learning how to write your own utilities.


Lightweight


I've dropped Frames, which means that I can now drop the index-core dependency. This means that the library only has one dependency: transformers-0.2. pipes is now the lightest dependency for any streaming library.

I will not add any extra dependencies to the main package and I will release all additional features in separate packages that build on top of pipes. I would like to see the core pipes package go as viral as possible. This is quite easy because in addition to having only a single dependency, pipes is:
  • Quick to compile
  • Safe Haskell
  • Light on extensions: Only KindSignatures and Rank2Types (for MFunctor and PFunctor)

Convenience imports


I added some extra imports from other modules to Control.Proxy, like lift and forever. I was pretty conservative in what I added and stuck to things that I considered essential for using pipes idiomatically.


hoist and lift


I renamed MFunctor's mapT to hoist. Sorry for the inconvenience, but mapT was a terrible name for several reasons and instead went with Edward's naming convention which I felt was more tasteful.

One of the really nice things to come out of pipes is a clean mechanism for mixing arbitrary monad transformer stacks and arbitrary proxy transformer stacks. It turns out that if two transformer stacks have the same base monad, then you can always interleave them in any way using the right combination of hoist and lift.

The exact same trick also applies to proxy transformer stacks. You can always interleave them using the right combination of hoistP (from the new PFunctor class) and liftP, which are just the higher-kinded versions of hoist and lift (and they also obey more laws, too).

The tutorial describes this trick in more detail and I find that this works phenomenally well and is very easy to reason about. For this reason I think MFunctor needs to be somewhere in the standard libraries as soon as possible because it is incredibly useful for this purpose.


Stability


I can state confidently that the pipes API can now handle any problem you throw at it and that the core is feature complete. The tutorial shows how you can combine the incredibly small pipes core into all the functionality you need. The only exception is prompt finalization, which I will describe in a separate upcoming library, but finalization similarly builds entirely on this core and does not add any new concepts or primitives.

There are only a few changes I foresee making to the main library in the future, which are:
  • Moving MFunctor to a better home
  • Adding a base proxy implementation with strictness annotations (never leaks, but half the performance of the fast implementation)
Note that this doesn't mean that pipes frequently leaks space. I've only found one case of leaking, which is the following code segment:
runProxy $ enumFromS 1
Everything else I have tried does not leak and I currently use pipes in long-running server applications of my own that also do not leak without any strictness annotations. However, if users do encounter leaks then I will go ahead and add the strict base implementation.

I will also be pretty conservative about expanding the core library because I want to stabilize it now. Almost all new functions will go in separate libraries now.


Conclusion


At this point I am very happy with the state of the main pipes library and I'm currently devoting my next efforts to standard libraries. The next libraries I will release will be (in this order):
  • Finalization support
  • bytestring / text support
  • Parsing extensions

Wednesday, October 31, 2012

pipes-2.5: Faster and slimmer

Introduction


I optimized the entire pipes library very aggressively for version 2.5, and now the library runs faster than conduit on my micro-benchmarks. I'll begin with the purest benchmark which gives the greatest difference in speed since it only measure the efficiency of each library's implementation without any IO bottlenecks:
import Control.Proxy
import Data.Conduit
import Data.Conduit.List as L

n = 1000000 :: Int

-- Pipes
main = runProxy $ discard <-< enumFromToS 1 n

-- Conduit
main = L.enumFromTo 1 n $$ L.sinkNull
Note some differences from last time. This time I'm using conduit's built-in optimized discard equivalent: sinkNull. Also, I've multiplied n by 10 to more accurately measure throughput. I compile both implementations with -O2.

pipes now spends about 112 ns per round-trip:
real    0m0.112s
user    0m0.104s
sys     0m0.004s
... while conduit spends about 167 ns per round-trip:
real    0m0.167s
user    0m0.156s
sys     0m0.008s
I achieved this speed increase by reverting to the pipes-1.0 trick of making the base monad optional, at the expense of breaking the monad transformer laws. I spent a considerable amount of effort trying to get the correct version to work, but I was led inexorably to the same conclusion that Michael already reached, which was that the original approach was best and that the gain in performance is worth bending the monad transformer laws.

Note that the above benchmark exaggerates the differences and is not indicative of real-world performance differences. For typical code you will not observe measurable differences between pipes and conduit when IO is the bottle-neck.

There is also one area in which conduit may still give (very slightly) better performance, which is in speeding up user-defined pipes. One goal I did not complete for this release was copying Michael's trick of using rewrite RULES to inline the Monad instance for user-defined pipes. I plan to copy this same trick in a separate release because I want to take the time to ensure that I can get the rewrite RULES to always fire without interfering with other optimizations.


Light-weight


The big focus of this release was to make pipes a very light-weight dependency, both in terms of performance and transitive dependencies. In the rewrite I dropped the free dependency so now the package only has two non-base dependencies:
  • transformers >= 0.2.0.0
  • index-core
... and I plan on dropping index-core along with Frames once I complete my resource management solution, leaving just a transformers dependency, which is about as light-weight as it could possibly get.

I also stole a page from Michael's book, by removing the -O2 flag from the pipes.cabal file. This flag no longer has any effect on performance after the rewrite, so you should see quicker compile times, making the pipes dependency even lighter.

There is still one other way I could make the pipes dependency even lighter, which is to remove the MFunctor class. The Control.MFunctor module requires Rank2Types, which might rule out pipes for projects that use non-GHC compilers, so if this is an issue for you, just let me know and I will try to migrate MFunctor to a separate library. Frames use a lot of extensions, but those will be on the way out, leaving behind just FlexibleContexts and KindSignatures, which are very mild extensions.


Resource management


I also wanted to use this update to point out that you can get deterministic resource management with pipes today if you use Michael's ResourceT in the base monad. So if you want to use pipes and all you care about is resource determinism then you can switch over already.

However, that alone will NOT give you prompt finalization and if you want promptness you will have to wait until I complete my own resource management extension. The extension I have in mind will be released as a proxy transformer that you can layer in any proxy transformer stack, so any proxy code you currently write can be transparently upgraded to work with resource management later on when I release the extension.

Another thing I want to mention is that while I will release the tools to manage resources promptly and deterministically, I do not plan on using these tools in the proxy standard libraries that I will release. The main reason for this is that:
  • There is no one true solution to finalization and I don't want people to have to buy in to my finalization approach to use the standard libraries I provide.
  • Most people I've talked to who care about finalization usually take the initiative write their own higher-level abstractions on top of whatever finalization primitives I provide them.
So my plan is that the standard libraries I will write will purely focus on the streaming part of the problem and leave initialization/finalization to the end user, which they can optionally implement using my resource management solution or whatever other approach they prefer (such as ResourceT, for example).

If you want to know what I personally use in my own projects at the moment, I just use the following pattern:
do withResource $ \h -> 
   runProxy $ ... <-< streamFromResource h
This gives "good enough" behavior for my purposes and out of all the finalization alternatives I've tried, it is by far the easiest one to understand and use.

The other reason I'm trying out this agnostic approach to finalization is due to discussion with Gregory Collins about his upcoming io-streams library, where he takes a very similar approach to the one I just described of leaving initialization/finalization to the end user to avoid cross-talk between abstractions and to emphasize handling the streaming aspect correctly.


Goals for the near future


I focused on improving the performance of pure code because I plan to release bytestring/text standard libraries and their corresponding parsing proxy transformers very soon, which demand exceptional pure performance. The goal of the upcoming proxy-based parsing libraries is not to beat attoparsec in speed (which I'm reasonably sure is impossible), but rather to:
  • Interleave parsing with with effects
  • Provide a low-memory streaming parser by allowing the user to selectively control backtracking
  • Still be really fast
The first two features are sorely missing from attoparsec, which can't interleave effects and always backtracks so that the file isn't cleared from memory until the parsing completes. For my own projects I need the second feature the most because I get a lot of requests to parse huge files (i.e. 20 GB) that do not fit in my computer's memory. More generally, I want pipes to be the fallback parser of choice for all problems that attoparsec does not solve.

Saturday, October 20, 2012

"Hello, core!"

Haskell optimization is very opaque and getting tight inner loops is something of a black art for people who don't take the time to learn GHC's core output. If you like fine-grained optimization, this post will walk you through a very simple example to help you learn to read core.

I'm planning to write some posts in the future discussing optimizing Haskell code and I would like to use this post as a foundation for those latter discussions.


ghc-core


When you practice reading core, stick to really simple programs. In fact, let's study the core generated by the simplest possible Haskell program:
main :: IO ()
main = return ()
The ghc-core tool produces a (slightly) readable core output. To install it, just type:
cabal install ghc-core
... and cabal should install it to ~/.cabal/bin/ghc-core.

We generate core output by running:
~/.cabal/bin/ghc-core --no-cast --no-asm program.hs
The --no-cast flag improves the readability by removing type casts and the --no-asm flag says not to include the final assembly code.

The above command automatically enables all optimizations and outputs to a pager the following two sections:
  • A list of optimizations that fired
  • The core section, which contains the intermediate core representation
The optimization section should look something like this:
0 Lets floated to top level; 0 Lets floated elsewhere; from ...

0 Lets floated to top level; 0 Lets floated elsewhere; from ...

Total ticks:     15

2 PreInlineUnconditionally
  1 x_abD
  1 s_abE
5 UnfoldingDone
  1 returnIO
  1 runMainIO
  1 main
  1 returnIO1
  1 $fMonadIO_$creturn
1 RuleFired 1 Class op return
2 LetFloatFromLet 2
2 EtaExpansion
  1 :main
  1 main
3 BetaReduction
  1 a_abC
  1 x_abD
  1 s_abE
10 SimplifierDone 10
... and the core should look something like this:
Result size = 20

main1
  :: State# RealWorld
     -> (# State# RealWorld, () #)
[GblId,
 Arity=1,

 Unf=Unf{Src=, TopLvl=True, Arity=1, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=ALWAYS_IF(unsat_ok=True,boring_ok=True)}]
main1 =
  \ (eta_B1 :: State# RealWorld) ->
    (# eta_B1, () #)

main :: IO ()
[GblId,
 Arity=1,

 Unf=Unf{Src=, TopLvl=True, Arity=0, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=ALWAYS_IF(unsat_ok=True,boring_ok=True)}]
main =
  main1
  

main2
  :: State# RealWorld
     -> (# State# RealWorld, () #)
[GblId,
 Arity=1,

 Unf=Unf{Src=, TopLvl=True, Arity=1, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=IF_ARGS [0] 30 0}]
main2 =
  \ (eta_X7 :: State# RealWorld) ->
    runMainIO1
      @ ()
      (main1
       )
      eta_X7

:main :: IO ()
[GblId,
 Arity=1,

 Unf=Unf{Src=, TopLvl=True, Arity=0, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=ALWAYS_IF(unsat_ok=True,boring_ok=True)}]
:main =
  main2
That still looks pretty ugly, so let's filter it a bit.


Core


You'll notice that every function has a type signature, without exception. For example, main1's type is:
main1
  :: State# RealWorld
     -> (# State# RealWorld, () #)
However, each function also has a set of attributes that ghc uses to guide various optimization passes. main1's attributes are:
[GblId,
 Arity=1,

 Unf=Unf{Src=, TopLvl=True, Arity=0, Value=True,
         ConLike=True, Cheap=True, Expandable=True,
         Guidance=ALWAYS_IF(unsat_ok=True,boring_ok=True)}]
For example, Cheap=True says that it is cheap to inline main1. Similarly, ConLike=True says it is okay if a rewrite rule requires duplicating main1's code.

However, for the purposes of this post, we will just completely ignore these annotations and just focus on the type signatures and code. I'll remove the attributes to leave behind something easier on the eyes:
main1
  :: State# RealWorld
     -> (# State# RealWorld, () #)
main1 =
  \ (eta_B1 :: State# RealWorld) ->
    (# eta_B1, () #)

main :: IO ()
main =
  main1

main2
  :: State# RealWorld
     -> (# State# RealWorld, () #)
main2 =
  \ (eta_X7 :: State# RealWorld) ->
    runMainIO1
      @ ()
      (main1
       )
      eta_X7

:main :: IO ()
:main =
  main2
Now we see that there are four functions total, but a lot of noise still remain.


Primitive types and values


Let's walk through this core, beginning with the type signature of main1:
main1
  :: State# RealWorld
     -> (# State# RealWorld, () #)
Somebody sprinkled #s all over our type signature. These #s denote primitive types and values exposed by the compiler that are not defined within the Haskell language. You can check out the GHC.Prim module which declares these values and type. Notice how it cheats and declares all types empty and the values are all set to let x = x in x just so that everything type-checks.

You can learn more about these primitive types and operations by reading the section on Unboxed types and primitive operations in the GHC user guide.


IO is a state monad


GHC implements IO as a newtype wrapper around a state monad:
newtype IO a = IO (State# RealWorld -> (# State# RealWorld, a #))
Notice that this is just the right type to wrap main1, which uses the same stateful pattern:
main1 :: State# RealWorld -> (# State# RealWorld, () #)
IOs state-based implementation ensures proper sequencing of computations and the State# RealWorld token is an empty token that exists solely to ensure that each sequenced computation depends on the previous one so that the compiler does not rearrange, merge or duplicate them.


The chain of command


Now, let's study the definition of main1:
main1 =
  \ (eta_B1 :: State# RealWorld) ->
    (# eta_B1, () #)
If we tidy this up and remove the type annotation, we get:
main1 = \s -> (s, ())
This is just return () in the State (State# RealWorld) monad. The compiler translated our original return () statement into the equivalent stateful function under hood.

However, the next function seems to accomplish nothing:
main :: IO ()
main =
  main1
... but that's only because the --no-casts flag removed type casts. If we add them back in, we get:
main =
  main1
  `cast` (Sym (NTCo:IO <()>)
          :: (State# RealWorld
              -> (# State# RealWorld, () #))
               ~#
             IO ())
That `cast` statement corresponds to the IO constructor from our IO newtype:
IO :: (State# RealWorld -> (# State# RealWorld, a #)) -> IO a
This constructor hides the underlying stateful representation behind the opaque IO newtype. However, unlike data constructors, newtype constructors cost nothing and don't translate into functions with a run-time overhead. Instead, they translate into free compile-time cast statements which the compiler erases once compilation is complete.

This explains why main has a different type from main1:
main :: IO ()
main packages our impure code inside the IO newtype, whereas main1 contains the raw underlying stateful representation.

In fact, main corresponds exactly to the main function we wrote in our original code (which is why it has the exact same name):
-- The original program
main :: IO ()
main = return ()
The next function appears to run our code:
main2 =
  \ (eta_X7 :: State# RealWorld) ->
    runMainIO1
      @ ()
      (main1
       )
      eta_X7
With a little tidying up, this becomes:
main2 = \s -> runMainIO1 main1 s
But wait, it runs main1 and not main! I'm guessing that the main token exists solely to establish the correspondence to the user-written code, but perhaps ghc doesn't actually care about it and instead uses the raw main1 directly.

Now, to be honest, I don't actually know what runMainIO1 does, but according to Edward Yang, it initializes some things like interrupt handlers before running the program.

Finally, we reach the true entry point of our program:
:main :: IO ()
:main =
  main2 -- There is a hidden cast here
This passes our program to the Haskell runtime to be executed. That's it! We're done.

Monday, October 15, 2012

Parsing chemical substructures

This is the second in a series of coding examples from my own work. /r/programming asked for non-trivial, yet digestible, Haskell examples, so I hope this fits the bill.

In this post I will show how learning Haskell changes the way you think. I will take a common Haskell idiom (monadic parsing) and apply it in a new way to solve a structural bioinformatics problem. To make this post more accessible to non-Haskell programmers, I'm going to gloss over implementation details and instead try to discuss at a high-level how I apply Haskell idioms to solve my problem.


The Parser type


When we approach parsing problems we normally focus on the text as the central element of our algorithm and design our program around combining and manipulating text. Monadic parsing turns this approach on its head and places the parser front and center, where we instead combine and manipulate parsers.

This means we need to define a Parser type:
type Parser a = String -> [(a, String)]
A Parser a takes a starting String as its input and parses it to return a single value of type a and the remaining unconsumed String. There might be multiple valid parses, so we actually return a list of possible parsings instead of a single one. If the parse fails, we return an empty list signifying no valid parsings.

We can also use Haskell's newtype feature to encapsulate the implementation and hide it from the user:
newtype Parser a
  = Parser { runParser :: String -> [(a, String)] }
This defines the Parser constructor which we use to wrap our parsing functions:
Parser :: (String -> [(a, String)]) -> Parser a
.. and the runParser function which unwraps the Parser to retrieve the underlying function:
runParser :: Parser a -> (String -> [(a, String)])
Additionally, this encapsulation enables some "magic" later on.


Parsers


Haskell has a Bool data type, defined as:
data Bool = True | False
... so let's define a Parser that parses a True value:
import Data.List

parseTrue :: Parser Bool
parseTrue = Parser (\str ->
    if (isPrefixOf "True" str)
    then [(True, drop 4 str)] -- Parse succeeds: 1 result
    else []                   -- Parse fails   : 0 results
    )
Similarly, we can define a Parser for False:
parseFalse :: Parser Bool
parseFalse = Parser (\str ->
    if (isPrefixOf "False" str)
    then [(False, drop 5 str)] -- Parse succeeds: 1 result
    else []                    -- Parse fails   : 0 results
    )
Let's test out our parsers using ghci:
>>> runParser parseTrue "True Story"
[(True, " Story")]
>>> runParser parseFalse "True Story" -- Fails: not False
[]
>>> runParser parseFalse "Falsehood"
[(False, "hood")]
>>> runParser parseFalse " Falsehood" -- Fails: leading space
[]
What if we want to also skip initial spaces? We can define a parser that always succeeds and returns the string trimmed of all leading spaces:
skipSpaces :: Parser ()
skipSpaces = Parser (\str ->
    [((), dropWhile (== ' ') str)] -- Always succeeds: 1 result
    )
Let's confirm it works:
>>> runParser skipSpaces "          Falsehood"
[((), "Falsehood")]
>>> runParser skipSpaces "Apple"
[((), "Apple")]

Monads


Now we want to combine our Parsers so we can parse both a True and a False with optional spaces in between. This means we need some elegant way to take the unconsumed input from each Parser and feed it directly into the next Parser in the chain.

Fortunately, Haskell solves this problem cleanly using monads. A Monad defines an interface to two functions:
class Monad m where
    return :: a -> m a
    (>>=)  :: m a -> (a -> m b) -> m b
Like interfaces in any other language, we can program generically to that interface. Haskell's do notation works with this generic Monad interface, so we can use the imperative do syntax to manipulate anything that implements Monad.

Our Parser implements the Monad interface quite nicely:
instance Monad Parser where
    return a = Parser (\str  -> [(a, str)])
    m >>= f  = Parser (\str1 ->
        -- This is a list comprehension, basically
        [(b, str3) | (a, str2) <- runParser m     str1,
                     (b, str3) <- runParser (f a) str2]
        )
This is not a monad tutorial, so I'm glossing over why that is the correct definition or what it even means, but if you want to learn more about monads, I highly recommend: You could have invented monads.

When we make Parser a Monad, we gain the ability to assemble Parsers using do notation, so let's use do notation to combine multiple parsers in an imperative style:
trueThenFalse :: Parser (Bool, Bool)
trueThenFalse = do
    t <- parseTrue
    skipSpaces
    f <- parseFalse
    return (t, f)
That reads just like imperative code: parse a True, skip some spaces, then parse a False. Finally, return the two values you parsed. This seems straightforward enough until you realize we haven't actually parsed any text, yet! All we've done is combine our smaller parsers into a larger parser poised to be run on as many inputs as we please.

Let's make sure it works as advertised:
>>> runParser trueThenFalse "True   False leftovers"
[((True, False), " leftovers")]
>>> runParser trueThenFalse "False   True"
[]

Alternatives


Sometimes we want to try multiple parsing alternatives. For example, what if I want to parse a True or a False? I can define a (<|>) operator that tries both parsers and then returns the union of their results:
(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = Parser (\str ->
    runParser p1 str ++ runParser p2 str
    )
Now I can parse a Bool value without specifying which one and the parser will return which one it parsed:
parseBool :: Parser Bool
parseBool = parseFalse <|> parseTrue
>>> runParser parseBool "True Story"
[(True, " Story")]
>>> runParser parseBool "Falsehood"
[(False, "hood")]


Parsing chemistry


Parsers have one more feature that might surprise you: There is nothing String-specific about them! With one tiny modification, we can generalize them to accept any type of input:
newtype Parser s a
  = Parser { runParser :: s -> [(a, s)] }

instance Monad (Parser s) where
    <EXACT same code as before>

(<|>) :: Parser s a -> Parser s a -> Parser s a
(<|>) = <EXACT same code as before>
Since s is "polymorphic", we can set it to any conceivable type and the above code still works. The only String-specific behavior lies within the specific parser definitions, and their new compiler-inferred types reflect that:
-- These parsers only accept Strings as input
parseFalse    :: Parser String Bool
parseTrue     :: Parser String Bool
trueThenFalse :: Parser String (Bool, Bool)
But there's no reason we can't define entirely different Parsers that accept completely different non-textual input, such as chemical structures. So I'll switch gears and define parsers for chemical Structures, where a Structure is some sort of a labeled graph:
data Structure = Structure {
    graph :: Graph          , -- Adjacency list
    atoms :: Vector AtomName} -- The node labels
... with some convenience functions I've defined for manipulating the Graph:
-- Return the edges of the graph
bonds :: Graph -> [Edge]

-- Remove an edge from the graph
deleteBond :: Edge -> Graph -> Graph
Now, I can define new Parsers that operate on Structures instead of Strings.


Parsing bonds


The most primitive parser I'm interested in parses a single bond. It requires two AtomNames which specify what kind of bond to look for (i.e. a carbon-carbon bond, except it can be even more specific). Then, it outputs which two indices in the graph matched that bond-specification:
parseBond :: AtomName -> AtomName -> Parser Structure (Int, Int)
I can use the list monad (i.e. a list comprehension) to define this primitive parser (and don't worry if you can't precisely follow this code):
parseBond name1 name2
  = Parser $ \(Structure oldGraph atoms) -> do
    -- The first atom must match "name1"
    i1 <- toList (findIndices (== name1) atoms)

    -- Some neighboring atom must match "name2"
    i2 <- filter (\i -> atoms ! i == name2) (oldGraph ! i1)

    -- Remove our matched bond from the graph
    let newGraph = deleteBond (i1, i2) oldGraph

    -- .. and return the matched indices
    return ((i1, i2), Structure newGraph atoms)
Haskell strongly encourages a pure functional style, which keeps me from "cheating" and using side effects or mutation to do the parsing. By sticking to a pure implementation, I gain several bonus features for free:
  • If our bond occurs more than once, this correctly matches each occurrence, even if some matches share an atom
  • If both AtomNames are identical, this correctly returns both orientations for each matched bond
  • This handles backtracking with (<|>) correctly
  • I can parallelize the search easily since every search branch is pure
I got all of that for just 6 lines of code!


Parsing substructures


Now I can build more sophisticated parsers on top of this simple bond parsers. For example, I can build a generic substructure parser which takes a sub-Structure to match and returns a list of matched indices:
parseSubstructure :: Structure -> Parse Structure [Int]
Again, if you don't precisely understand the code, that's okay:
parseSubstructure (Structure graph as)
    -- Use the State monad to keep track of matches
  = (`evalStateT` (V.replicate (V.length as) Nothing)) $ do

        -- foreach (i1, i2) in (bonds graph):
        forM_ (bonds graph) $ \(i1, i2) -> do

            -- Match the bond
            (i1', i2') <- lift $ parseBond (as ! i1) (as ! i2)

            -- The match must be consistent with other matches
            matches    <- get
            let consistent i1 i1' = case (matches ! i1) of
                    Nothing   -> True
                    Just iOld -> iOld == i1'
            guard (consistent i1 i1' && consistent i2 i2')

            -- Update the match list
            put (matches // [(i1, Just i1'), (i2, Just i2')])

        -- Return the final list of matches
        matchesFinal <- get
        justZ . sequence . toList $ matchesFinal
Like before, the code detects all matching permutations and backtracks if any step fails.


Reusable abstractions


I find it pretty amazing that you can build a substructure parser in just 18 lines of Haskell code. You might say I'm cheating because I'm not counting the amount of lines of code I took to define the Parser type, the Monad implementation, and the (<|>) type. However, the truth is I can actually get all of those features using 1 line of Haskell:
type Parser s = StateT s []
-- and rename all 'Parser' constructors to 'StateT'
So I lied: it's actually 19 lines of code.

I don't expect the reader to know what StateT or [] are, but what you should take away from this is that both of them are part of every Haskell programmer's standard repertoire of abstractions.

Moreover, when I combine them I automatically get a correct Monad implementation (i.e. do notation) and a correct Alternative implementation (which provides the (<|>) function), both for free!


Conclusions


This is just one of many abstractions I used to complete a structural search engine for proteins. Now that it's done, I'll be blogging more frequently about various aspects of the engine's design to give people ideas for how they could use Haskell in their own projects. I hope these kinds of code examples pique people's interest in learning Haskell.


Appendix


I've included the full code for the String-based Parsers. The Structure-based Parsers depend on several project-specific data types, so I will just release them later as part of my protein search engine and perhaps factor them out into their own library.

Also, as a stylistic note, I prefer to use ($) to remove dangling final parentheses like so:
Parse (\str ->        =>  Parse $ \str ->
    someCode          =>      someCode
    )                 =>
... but I didn't want to digress from the post's topic by explaining how the ($) operator behaves.
import Data.List

newtype Parser a
  = Parser { runParser :: String -> [(a, String)] }

parseTrue :: Parser Bool
parseTrue = Parser (\str ->
    if (isPrefixOf "True" str)
    then [(True, drop 4 str)]
    else []
    )

parseFalse :: Parser Bool
parseFalse = Parser (\str ->
    if (isPrefixOf "False" str)
    then [(False, drop 5 str)]
    else []
    )

skipSpaces :: Parser ()
skipSpaces = Parser (\str ->
    [((), dropWhile (== ' ') str)]
    )

instance Monad Parser where
    return a = Parser (\str  -> [(a, str)])
    m >>= f  = Parser (\str1 ->
        [(b, str3) | (a, str2) <- runParser m     str1,
                     (b, str3) <- runParser (f a) str2]
        )

trueThenFalse :: Parser (Bool, Bool)
trueThenFalse = do
    t <- parseTrue
    skipSpaces
    f <- parseFalse
    return (t, f)

(<|>) :: Parser a -> Parser a -> Parser a
p1 <|> p2 = Parser (\str ->
    runParser p1 str ++ runParser p2 str
    )

parseBool :: Parser Bool
parseBool = parseFalse <|> parseTrue

Saturday, October 6, 2012

pipes-2.4: Proxy transformers, extra categories, utilities, and benchmarks

This release packs a LOT of new features, so I will begin with the most significant feature: proxy transformers. The proxy transformer pattern provides a very simple extension framework that cleanly solves many problems that iteratee library authors face.

Users of the library should read Control.Proxy.Trans.Tutorial, which explains how proxy transformers work. However, this post also provides a decent introduction to them, too.

Introduction


Wouldn't it be nice if you could catch and handle errors within a proxy? Now you can! It's as simple as:
import Control.Monad (forever)
import Control.Monad.Trans (lift)
import Control.Proxy
import Control.Proxy.Trans.Either as E
import Safe (readMay)

promptInts :: () -> EitherP String Proxy C () () Int IO r
promptInts () = recover $ forever $ do
    str <- lift getLine
    case readMay str of
        Nothing -> E.throw "Could not parse an integer"
        Just n  -> liftP $ respond n

recover p =
    p `E.catch` (\str -> lift (putStrLn str) >> recover p)

main = runProxy $ runEitherK $ mapP printD <-< promptInts
>>> main
1<Enter>
1
Test<Enter>
Could not parse an integer
Apple<Enter>
Could not parse an integer
5<Enter>
5
The above program condenses many new features of this release into a nice compact example and I'll use it to show-case each feature.


Proxy transformers


The above program uses the EitherP proxy transformer. To access this feature, you just import the transformer you wish to use:
import Control.Proxy.Trans.Either as E
Control.Proxy imports all the remaining machinery you need.

EitherP extends any proxy-like type with the ability to throw and catch errors locally, as if it lived inside a native EitherT block. It does so in such a way that preserves composition (and the category laws!), so you can directly compose the result without unwrapping the EitherP.

When you are done composing, just use runEitherK to convert it back to the underlying proxy:
runEitherK
  :: (q -> EitherP e p a' a b' b m r )
  -> (q -> p a' a b' b m (Either e r))

Utilities


This release introduces the "proxy prelude", a set of convenience functions for users of the library. Control.Proxy automatically exports these and they don't clash with the Prelude or any common libraries.

Our old friend printer got a name-change and now goes by printD. This utility function prints all values bound downstream:
printD :: Show a => x -> Proxy x a x a IO r
I provide many more utility functions under the Control.Proxy.Prelude hierarchy, and people who enjoyed my functor design pattern post will also enjoy the abundance of cute trivial examples of the functor pattern in the documentation for Control.Proxy.Prelude.Base.


Proxy transformers are functors


However, this release includes a far more sophisticated set of functors: the proxy transformers themselves. Each proxy transformer implements the ProxyTrans class which defines two functions: mapP and liftP, related by the equation:
mapP = (liftP .)
mapP defines two separate functors.

The first functor behaves like a traditional monad transformer, converting the base Kleisli category to the extended Kleisli category:
mapP return = return

mapP (f >=> g) = mapP f >=> mapP g
You can write these laws using liftP to see that our proxy transformers behave like ordinary monad transformers:
liftP $ return x = return x

do x <- liftP m
   liftP $ f x
= liftP $ do x <- m
             f x
The above program uses this capacity of liftP to lift operations from the Proxy monad to the EitherP String Proxy monad.

The second functor lifts the base proxy composition to the extended proxy composition:
mapP idT = idT

mapP (p1 >-> p2) = mapP p1 >-> mapP p2
This latter functor lets you compose simpler proxies with extended proxies. The above program uses mapP in this capacity to promote printD for composition with promptInts:
mapP printD <-< promptInts
This demonstrates a concrete application of the functor design pattern, allowing seamless interoperability between proxies written to varying feature sets. The proxy transformers lift both the monad instance and the composition instance correctly so that simpler proxies play nicely with extended proxies.


Type signatures


The above program demos the new replacement for Void: C. This will shorten type signatures and also removes the dependency on void.

Also, now the Proxy type is a newtype around the underlying FreeT implementation. This gives nicer type errors when things go wrong.


Proxy Transformer Stacks


Just like monad transformers, you can stack proxy transformers to automatically combine their effects. By combining the StateP and EitherP proxy transformers, you can implement non-backtracking parsers for free:
{-# LANGUAGE GeneralizedNewtypeDeriving, OverloadedStrings #-}

import Control.Monad.Trans
import Control.Proxy
import Control.Proxy.Trans.Either as E
import Control.Proxy.Trans.State
import Data.Text as T hiding (take)

newtype ParseP p a' a b' b m r =
    ParseP { unParseP ::
        StateP Text (EitherP Text p) a' a b' b m r }
    deriving (Monad, MonadTrans, Channel)

instance ProxyTrans ParseP where
    liftP = ParseP . liftP . liftP

runParseK
  :: (q -> ParseP p a' a b' b m r)
  -> (q -> p a' a b' b m (Either Text (r, Text)))
runParseK = runEitherK . runStateK T.empty . (unParseP .)
The Channel type class defines proxy composition, so we can compose our parsing proxies seamlessly.

Let's write a few parsing primitives:
import Data.Monoid
import Data.Text.IO as T
import Prelude hiding (take)

take n = ParseP go where
    go = do
        s <- get
        if (T.length s < n)
        then do
            s' <- liftP $ liftP $ request ()
            put (s <> s')
            go
        else do
            let (h, t) = T.splitAt n s
            put t
            return h

parseFail str = ParseP $ liftP $ E.throw str

string str = do
    str' <- take (T.length str)
    if (str' == str)
    then return str
    else parseFail $
        "Expected: " <> str <> " -- Found: " <> str'
You wouldn't even know those were proxies if it were not for that single request statement.

Let's write a contrived parser based off of those primitives:
parser () = do
    string "Hello"
    str <- take 5
    lift $ T.putStrLn str
... and supply it with some input:
source () = do
    respond "Hell"
    respond "o, world!"
Now compose!
>>> runProxy $ runParserK $ parser <-< mapP source
, wor
Right ((),"ld!")
Let's see how failed parses turn out:
invalid () = do
    respond "A"
    respond "AAAAAAAA"
>>> runProxy $ runParseK $ parser <-< mapP invalid
Left "Expected: Hello -- Found: AAAAA"
I didn't include parsers in the library because I didn't want to add a bytestring or text dependency to the main pipes package. Instead, I will release the parsing extension as a separate library. This library will provide you with the streaming benefits of attoparsec with the ability to interleave effects.


Pushback


The above parsing example suggests my solution to push-back, which is to give each proxy its own local state using the StateP proxy transformer. You can then use the local state to keep track of unused input, as the above parsing example did.

Like all proxy transformers, this extension requires no special integration with the underlying proxy type and you can layer it anywhere within a proxy transformer stack with no special considerations.


Extra categories


The library now provides two additional categories for interacting with the Proxy type. These are term-rewriting categories (I believe the technical term is "sesquicategory", but I may be mistaken).

The first category's composition operator replaces all request statements within a Proxy with a suitably typed replacement:
f /</ g -- Replace all occurrences of 'request' in 'f' with 'g'
request is the identity of this category, so we expect that:
-- Replacing 'request' with 'request' changes nothing
f /</ request = f

-- Replacing 'request' with 'f' gives 'f'
request /</ f = f
Also, this substitution is associative:
(f /</ g) /</ h = f /</ (g /</ h)
Similarly, the respond command has its own substition operator, (\<\), and they form their own category:
f \<\ g  -- Replaces all 'respond's in 'g' with 'f'

f \<\ respond = f

respond \<\ f = f

(f \<\ g) \<\ h = f \<\ (g \<\ h)
Each category distributes in one direction over the Kleisli category:
-- Distributivity
r \<\ (f <=< g) = (r \<\ f) <=< (r \<\ g)

-- Zero
r \<\ return = return

-- Distributivity
(f <=< g) /</ r = (f /</ r) <=< (g /</ r)

-- Zero
return /</ r = return

Lifting request and respond


I originally envisioned that proxy transformers would also automatically lift request and respond statements. The laws for this lifting are quite simple:
mapP request = request

mapP respond = respond
In other words, the functor laws, applied to the identity of the two new categories I just introduced. However, unfortunately Haskell's type class system severely got in my way and I could not solve the issue before the release. I have a tentative plan for how to solve this using Edward's constraint package but it will take time. Until then, you will have to manually lift request and respond statements from the base Proxy type.

Overall, I was pretty disappointed with Haskell's type class system (more so than usual). This library really exercised it considerably and I even had to drop an additional proxy transformer because it was unimplementable due to the broken constraint system.


Performance


Raw proxies give performance comparable to conduit when doing simple IO:
import Control.Monad
import Control.Monad.Trans
import Control.Proxy hiding (await)
import Data.Conduit
import Data.Conduit.List as L
import Data.Maybe (fromJust) -- You did not see this

n = 100000 :: Int

-- Choose your poison
main = runProxy $ printD <-< enumFromToS 1 n
main = L.enumFromTo 1 n
    $$ forever (await >>= lift . print . fromJust)
Using pipes:
real    0m1.761s
user    0m0.384s
sys     0m0.712s
Using conduit:
real    0m1.528s
user    0m0.224s
sys     0m0.660s
Conduit is 15% faster.

The margin is substantially larger for entirely pure code:
import Control.Monad
import Control.Monad.Trans
import Control.Proxy hiding (await)
import Data.Conduit
import Data.Conduit.List as L

n = 100000 :: Int

main = runProxy $ discard <-< enumFromToS 1 n

discard' = do
    a <- await
    case a of
        Nothing -> return ()
        Just _  -> discard'

main = L.enumFromTo 1 n $$ discard'
Using pipes:
real    0m0.085s
user    0m0.088s
sys     0m0.000s
Using conduit:
real    0m0.011s
user    0m0.004s
sys     0m0.004s
Conduit is almost 8(!) times faster.

Conduit dramatically improves for entirely pure code since it bends the monad transformer laws to skip binds in the base monad. This is one reason that this pipes release type-classes all the Proxy operations. If people request that I copy conduit's approach, I will release a separate library that copies conduit's optional monad bind and have it implement all the same type-classes. Then all the proxy transformers are guaranteed to work transparently with it because they abstract completely over the type classes.

Additionally, I want to note that the pipes library currently has only one optimization PRAGMA in the entire library:
{-# INLINABLE mapK #-} -- An obscure utility function
... whereas conduit uses a considerable number of rewrite rules and INLINABLE statements. I don't know how much these contribute to conduit's speed, but I will copy Michael's optimizations in the next few releases and benchmark how much they contribute to performance.

Additionally, I've also benchmarked the overhead of proxy transformers. First, comparing performance for some trivial IO:
import Control.Monad
import Control.Monad.Trans
import Control.Proxy
import Control.Proxy.Trans.Writer
import Data.Monoid

n = 100000 :: Int

main = runProxy $ without <-< enumFromToS 1 n

main :: IO ((), Sum Int)
main = runProxy $ runWriterK $ with <-< mapP (enumFromToS 1 n)

with
 :: (Monoid w, Show a)
 => () -> WriterP w Proxy () a () C IO r
with () = forever $ do
    n <- liftP $ request ()
    lift $ print n

without :: (Show a) => () -> Proxy () a () C IO r
without () = forever $ do
    n <- request ()
    lift $ print n
Using the bind in the WriterP w Proxy monad (i.e. with):
real    0m1.739s
user    0m0.396s
sys     0m0.680s
Using the bind in the Proxy monad (i.e. without):
real    0m1.704s
user    0m0.368s
sys     0m0.668s
A difference of 2%(!).

Again, the difference widens if you switch to pure code:
import Control.Monad
import Control.Monad.Trans
import Control.Proxy
import Control.Proxy.Trans.Writer
import Data.Monoid

n = 100000 :: Int

main = runProxy $ without <-< enumFromToS 1 n

main :: IO ((), Sum Int)
main = runProxy $ runWriterK $ with <-< mapP (enumFromToS 1 n)

with
 :: (Monoid w, Show a)
 => () -> WriterP w Proxy () a () C IO r
with () = forever $ liftP $ request ()

without :: (Show a) => () -> Proxy () a () C IO r
without () = forever $ request ()
Using WriterP w Proxy's bind:
real    0m0.134s
user    0m0.124s
sys     0m0.008s
Using Proxy's bind:
real    0m0.084s
user    0m0.076s
sys     0m0.004s
Now it's about a factor of 2.

So I can summarize these benchmarks by saying that if you are doing even a little bit of IO, the performance differences are pretty small, and as I aggressively optimize the library, they should get even smaller.


Switch to free


Edward was kind enough to migrate my transformers-free functionality into his free package, so now pipes uses free for its free monad transformer dependency.


Resource management


I plan on releasing a Proxy-like type that implements resource management that will replace the Frame type. This type will include functions to promote existing Proxy code to this resource-managed version. Until then, you will have to manually manage resources by opening all file handles before composition, and closing them all afterwards, like so:
import Control.Proxy
import System.IO

main = do
    h <- openFile "test.txt" WriteMode
    runProxy $ hPrintD h <-< enumFromToS 1 10
    hClose h
... or you can use Michael's ResourceT in the base monad, if that is your thing.

You won't get the benefit of conserving handles, but you will still get predictable streaming performance.


Library writers


If you are considering building off the pipes library, I recommend implementing any functionality using the Proxy type, which I guarantee will be promotable to any future extensions, and I plan on personally writing several Proxy-based libraries over the next few months.

While I still preserve the Pipe type, I fully endorse the Proxy type as the type to standardize on as it has many more nice theoretical properties than the Pipe type and also supports greater functionality.


Conclusions


This release is very close to the final state I envisioned for the core pipes library. Most existing features won't disappear, with the exception of Control.Frame, which I will phase out once I release a suitable replacement in a separate library.

Most additional features that I plan on implementing will go into separate libraries that build on top of this one. I only plan on adding functionality to the core library if I discover additional interesting structure for the Proxy type.

Wednesday, September 19, 2012

The MonadTrans class is missing a method

My work on pipes-2.4 leads me to the inescapable conclusion that the MonadTrans class is incomplete. In an ideal world, this is what it should actually look like:
{-# LANGUAGE Rank2Types #-}

class MonadTrans t where
    lift  :: (Monad m, Monad (t m)) => m a -> t m a
    embed :: (Monad m, Monad (t m), Monad (t n))
          => (forall a .   m a -> t n a)
          -> (forall b . t m b -> t n b)
          -- This last forall is optional

(>|>)
 :: (Monad f, Monad g, Monad (t g), Monad (t h),
     MonadTrans t)
 => (forall a . f a -> t g a)
 -> (forall b . g b -> t h b)
 -> (forall c . f c -> t h c) -- This last forall is optional
(f >|> g) x = embed g (f x)

squash :: (Monad (t (t m)), Monad (t m), MonadTrans t)
       => t (t m) a -> t m a
squash = embed id

mapT
 :: (Monad m, Monad n, Monad (t m), Monad (t n), MonadTrans t)
 => (forall a . m a -> n a) -> t m b -> t n b
mapT morph = embed (lift . morph)
I can justify this additional method just by changing the names around and using a type operator:
{-# LANGUAGE Rank2Types, TypeOperators #-}

type a :~> b = forall r . a r -> b r

class MonadM m where
    returnM :: (Monad a, Monad (m a))
            =>  a :~> m a
    bindM   :: (Monad a, Monad (m a), Monad (m b))
            => (a :~> m b) -> (m a :~> m b)

(>|>) :: (Monad a, Monad b, Monad (m b), Monad (m c),
          MonadTrans m)
      => (a :~> m b) -> (b :~> m c) -> (a :~> m c)
(f >|> g) x = bindM g (f x)

joinM :: (Monad (m (m a)), Monad (m a), MonadTrans m)
       => m (m a) :~> m a
joinM = bindM id

fmapM
 :: (Monad a, Monad b, Monad (m a), Monad (m b), MonadTrans m)
 => (a :~> b) -> (m a :~> m b)
fmapM f = bindM (returnM . f)

In otherwords, I've stolen a page from Conor McBride's notebook and defined lift and embed as a higher-order monad in the category of monad morphisms. Going back to the previous names, we can establish that certain laws must hold:
-- Categorical version
lift >|> f = f
f >|> lift = f
(f >|> g) >|> h = f >|> (g >|> h)

-- bind/return version
embed lift m = m
embed f (lift m) = f m
embed g (embed f m) = embed (\x -> embed g (f x)) m

-- join/return version
squash (lift m) = m
squash (mapT lift m) = m
squash (squash m) = squash (mapT squash m)
Obviously, I won't suggest we break the existing MonadTrans class by adding an additional method. All we have to do is simply define a new MonadM class and make all existing monad transformers instances of it and possibly make MonadTrans a super-class of it.

I'll bet more experienced Haskell programmers have wanted mapT or squash in one form or another. The above type-class provides a uniform interface to these operations, so that you don't have to rely on transformer-specific functions like mapStateT or mapMaybeT.

Note that all monad transformers have a sensible instance for MonadM that obeys the above laws. Usually the easiest route is to first define squash (i.e. joinM) and mapT (i.e. fmapM). mapT is usually very straight-forward to write and simply involves type-chasing. squash simply takes the inner monad transformer and combines its operations with the outer monad transformer. Once you define these two, then you can easily define embed:
-- i.e.: (bindM f) = joinM . fmapM f
embed f = squash . mapT f
In the near future I will release a package containing this type-class and appropriate instances for the standard monad transformers. Additionally, the pipes-2.4 release will include an extra motivation for defining the above type-class, besides the obvious utility of having mapT and squash functions.

Saturday, September 15, 2012

The functor design pattern

This post builds on my previous post on the category design pattern and this time I will discuss the functor design pattern. If you are an intermediate Haskell programmer and you think you already understand functors, then think again, because I promise you this post will turn most of your preconceptions about functors on their head and show you how functors are much more powerful and generally applicable than you might realize.


Mixing features


So let's pretend that my previous post on categories inflamed your interest in compositional programming. The first thing you might ask is "Which category do I use?". This seems like a perfectly reasonable question, since you'd like to pick a category that:
  • attracts a lot of mindshare,
  • contains a large library of reusable components,
  • boasts many features, and
  • is simple to use.
Unfortunately, reality says that we seldom get all of the above qualities and they often conflict with one another. For example, let's compare two categories I previously discussed:
  • Ordinary functions of type a -> b that you compose using: (.)
  • Monadic functions of type a -> m b that you compose using: (<=<)
Ordinary functions are simpler to read, write, use, and you can reason about their behavior more easily. However, monadic functions boast more useful features, some of which are indispensable (such as side effects when we use the IO monad). We really need some way to mix these two categories together to get the best of both worlds.

Fortunately, programmers solve compatibility problems like this all the time. We often have tools written in different languages or different frameworks and if we want to mix features from multiple frameworks we write code to bridge between them. So let's solve our category mixing problem by writing an adapter layer between the two categories. We write some function that transforms all the components in one category into components in the other category, so that we can then freely mix components from the two categories.

Typically, one category will be more featureful than the other, so the transformation is unidirectional. Using the above example, monadic functions are strictly more featureful and powerful than ordinary functions. Fortunately, we can promote all ordinary functions to monadic functions using the following map function:
-- This "map"s an ordinary function to a monadic function
map :: (Monad m) => (a -> b) -> (a -> m b)
map f = return . f
... but we cannot write the reverse function and automatically map every monadic function to an ordinary function.

We use map to combine a pure function f and a monadic function g. To do this, we promote f using map and then combine both of them using Kleisli composition:
f     ::              a ->   b
map f :: (Monad m) => a -> m b

g     :: (Monad m) => b -> m c


g <=< map f :: (Monad m) => a -> m c
Perfect! Now we can reuse all of our ordinary functions within this Kleisli category and not have to rewrite anything!

However, there's still a problem. Monad binds are not free and sometimes they get in the way of compiler optimization, so you can imagine that it would be wasteful if we lifted two pure functions in a row:
f     ::              a ->   b
map f :: (Monad m) => a -> m b

g     ::              b ->   c
map g :: (Monad m) => b -> m c

h     :: (Monad m) => c -> m d

-- Wasteful!
h <=< map g <=< map f :: (Monad m) => a -> m d
However, we're smart and we know that we can just optimize those two ordinary functions by using ordinary function composition first before lifting them with map:
-- Smarter!
h <=< map (g . f)
In other words, we assumed that the following transformation should be safe:
map g <=< map f = map (g . f)
Similarly, we expect that if we lift an identity function into a chain of Kleisli compositions:
g <=< map id <=< f
... then it should have no effect. Well, we can easily prove that because:
map id = return . id = return
.. and return is the identity of Kleisli composition, therefore:
f :: (Monad m) => a -> m b
g :: (Monad m) => b -> m c

map id :: (Monad m) => b -> m b

g <=< map id <=< f
= g <=< return <=< f
= g <=< f  :: (Monad m) => a -> m c
Well, we just unwittingly defined our first functor! But where is the functor?


Functors


A functor transforms one category into another category. In the previous section we transformed the category of Haskell functions into the category of monadic functions and that transformation is our functor.

I will notationally distinguish between the two categories in question so I can be crystal clear about the mathematical definition of a functor. I will denote our "source" category's identity as idA and its composition as (._A), and these must obey the category laws:
-- Yes, "._A" is ugly, I know
idA ._A f = f                      -- Left identity

f ._A idA = f                      -- Right identity

(f ._A g) ._A h = f ._A (g ._A h)  -- Associativity
Similarly, I denote the "destination" category's identity as idB and its composition as (._B), which also must obey the category laws:
idB ._B f = f                      -- Left identity

f ._B idB = f                      -- Right identity

(f ._B g) ._B h = f ._B (g ._B h)  -- Associativity
Then a functor uses a function that we will call map to convert every component in the source category into a component in the destination category.

We expect this map function to satisfy two rules:

Rule #1: map must transform the composition operator in the source category to the composition operator in the destination category:
map (f ._A g) = map f ._B map g
This is the "composition law".

Rule #2: map must transform the identity in the source category to the identity in the destination category:
map idA = idB
This is the "identity law".

Together these two rules are the "functor laws" (technically, the covariant functor laws).

In the last section, our source category "A" was the category of ordinary functions:
idA   = id
(._A) = (.)
... and our destination category "B" was the Kleisli category:
idB   = return
(._B) = (<=<)
... and our map function obeyed the functor laws:
map id = return
map (f . g) = map f <=< map g
In other words, functors serve as adapters between categories that promote code written for the source category to be automatically compatible with the destination category. Functors arise every time we write compatibility layers and adapters between different pieces of software.


Functors hidden everywhere


I'll provide a few more examples of functors to tickle people's brains and show how functors arise all the time in your code without you even realizing it. For example, consider the length function:
length :: [a] -> Int
We can treat list concatenation as a category, where:
(.) = (++)
id  = []

[] ++ x = x                    -- Left  identity
x ++ [] = x                    -- Right identity
(x ++ y) ++ z = x ++ (y ++ z)  -- Associativity
Similarly, we can treat addition as a category, where:
(.) = (+)
id  = 0

0 + x = x                  -- Left  identity
x + 0 = x                  -- Right identity
(x + y) + z = x + (y + z)  -- Associativity
Then length is a functor from the category of list concatentation to the category of integer addition:
-- Composition law
length (xs ++ ys) = length xs + length ys

-- Identity law
length [] = 0
Or consider the pipe function from Control.Pipe:
pipe :: (Monad m) => (a -> b) -> Pipe a b m r

-- Composition law
pipe (f . g) = pipe f <+< pipe g

-- Identity law
pipe id = idP
Also, concat defines a functor from one list concatenation to another:
-- Composition
concat (x ++ y) = concat x ++ concat y

-- Identity
concat [] = []
So don't assume that the Functor class is in any way representative of the full breadth of functors.


The Functor class


So far I've deliberately picked examples that do not fit within the mold of Haskell's Functor class to open people's minds about functors. A lot of new Haskell programmers mistakenly believe that functors only encompass "container-ish" things and I hope the previous examples dispel that notion.

However, the Functor class still behaves the same way as the functors I've already discussed. The only restriction is that the Functor class only encompass the narrow case where the source and target categories are both categories of ordinary functions:
class Functor f where
    fmap :: (a -> b) -> (f a -> f b)

fmap (f . g) = fmap f . fmap g  -- Composition law

fmap id = id                    -- Identity law
Haskell Functors recapitulate the themes of compatibility between categories and component reuse. For example, we might have several ordinary functions lying around in our toolbox:
f :: a -> b
g :: b -> c
.. but we need to manipulate lists using functions of type:
h :: [a] -> [c]
Rather than rewrite all our old functions to work on lists, we can instead automatically promote all of them to work on lists using the map function from the Prelude:
map :: (a -> b) -> ([a] -> [b])

map f :: [a] -> [b]
map g :: [b] -> [c]

h = map f . map g :: [a] -> [c]
We know that we can combine two passes over a list into a single pass:
h = map (f . g) :: [a] -> [c]
.. and doing nothing to each element does nothing to the list:
map id = id
Once again, we've just stated the functor laws:
map (f . g) = map f . map g  -- Composition law

map id = id                  -- Identity law
Notice that functors free us from having to write code that targets some monolithic category. Instead, we write all our code using whatever category we deem most appropriate and then promote it as necessary to whatever other categories might need the code we just wrote. This lets us work within focused and specialized categories suitable for their respective tasks rather than waste our time arguing over what category to standardize on.

Another benefit of functors is that they make our code automatically future-proof. We write our components using whatever category we have at our disposal and then as new categories arise we just define new functors to promote our existing code to work within those new categories.


Monad morphisms


Compatibility issues arise all the time between various Haskell frameworks. For example, let's assume I have a sizeable code-base written using the iteratee library, but then I find a really useful library on hackage using enumerator. I would rather not rewrite the enumerator-based library to use iteratee so I instead choose to write an adapter function that allows me to mix the two. I have to define some function, morph, that transforms Iteratees from the iteratee library into Iteratees from the enumerator library:
import qualified Data.Enumerator as E
import qualified Data.Iteratee.Base as I

morph :: I.Iteratee a m b -> E.Iteratee a m b
However, I might suspect that the iteratee library has a faster Monad instance since it uses continuation-passing style (disclaimer: I have no idea if this is true, it's just a hypothetical example). This means that I would like to be able to factor code to use the iteratee library's monad whenever possible:
f :: a -> I.Iteratee s m b

g :: b -> I.Iteratee s m c

h :: c -> E.Iteratee s m d

-- Hypothetically slower, since it uses E.Iteratee's bind
code1 :: a -> E.Iteratee s m d
code1 a = do b <- morph $ f a
             c <- morph $ g b
             h c

-- Hypothetically faster, since it uses I.Iteratee's bind
code2 :: a -> E.Iteratee s m d
code2 a = do c <- morph $ do b <- f a
                             g b
             h c
I would also expect that if I do nothing using enumerator, that it's equivalent to doing nothing using iteratee:
morph $ return x
= return x
Interestingly, we encounter a pattern when we write the above functions using a point-free style:
code1 = h <=< (morph . g) <=< (morph . f)

code2 = h <=< (morph . (g <=< f))

morph . return = return
This pattern seems so familiar...
map :: (a -> I.Iteratee s m b) -> (a -> E.Iteratee s m b)
map = (morph .)

map (f <=< g) = map f <=< map g  -- Composition law

map return = return              -- Identity law
Oh, I've accidentally defined a functor! This time both the source and destination categories are Kleisli categories and the functor preserves both the composition and identity correctly.

Category theorists have a very specific name for the above pattern: a monad morphism. Specifically, a monad morphism is any function:
morph :: (Monad m, Monad n) => forall r . m r -> n r
... such that map = (morph .) defines a functor between two Kleisli categories:
map :: (Monad m, Monad n) => (a -> m b) -> (a -> n b)
map = (morph .)
Also, intermediate Haskell programmers will recognize a subtle variation on this pattern:
lift :: (Monad m, MonadTrans t) => m r -> t m r

(lift .) :: (Monad m, MonadTrans t) => (a -> m b) -> (a -> t m b)

-- Identity law
(lift .) return = return

-- Composition law
(lift .) (f >=> g) = (lift .) f >=> (lift .) g
These are just the monad transformer laws! However, they are usually written in this form:
lift $ return x = return x

lift $ do y <- f x
          g y
= do y <- lift $ f x
     lift $ g y
In other words, monad transformers are a special subset of monad morphisms and the monad transformer laws are just the functor laws in disguise!

Now, every time you use a monad transformer you can appreciate that you are using a functor as an adapter layer between two categories: the base monad's Kleisli category and the transformed monad's Kleisli category.


Conclusion


The functor design pattern embodies a philosophy of programming that emphasizes:
  • compatibility over standardization,
  • specialization over monolithic frameworks, and
  • short-term completion over future-proofing.
However, the above tenets, while popular, haven't completely taken hold because we associate:
  • compatibility with cruft,
  • specialization with fragmentation, and
  • short-term completion with lack of foresight.
In a future post I will discuss how the functor laws mitigate these problems and allow you to layer on as many abstractions over as many tools as you wish without the abstractions collapsing under their own weight.

Functors don't even necessarily need to be within a single programming language. A programmer could even use the category design pattern in completely separate programming languages and then use the functor design pattern to bridge components written in one language to another. Please don't limit your imagination to just the examples I gave!

However, the functor design pattern doesn't work at all if you aren't using categories in the first place. This is why you should structure your tools using the compositional category design pattern so that you can take advantage of functors to easily mix your tools together. This is true whether you are programming in one language or several languages. As long as each tool forms its own category and you obey the functor laws when switching between them, you can be very confident that all your tools will mix correctly.