Friday, April 4, 2014

Scalable program architectures

Haskell design patterns differ from mainstream design patterns in one important way:

  • Conventional architecture: Combine a several components together of type A together to generate a "network" or "topology" of type B

  • Haskell architecture: Combine several components together of type A to generate a new component of the same type A, indistinguishable in character from its substituent parts

This distinction affects how the two architectural styles evolve as code bases grow.

The conventional architecture requires layering on abstraction on top of abstraction:

Oh no, these Bs are not connectable, so let's make a network of Bs and call that a C.

Well, I want to assemble several Cs, so let's make a network of Cs and call that a D

...

Wash, rinse, and repeat until you have an unmanageable tower of abstractions.

With a Haskell-style architecture, you don't need to keep layering on abstractions to preserve combinability. When you combine things together the result is still itself combinable. You don't distinguish between components and networks of components.

In fact, this principle should be familiar to anybody who knows basic arithmetic. When you combine a bunch of numbers together you get back a number:

3 + 4 + 9 = 16

Zero or more numbers go in and exactly one number comes out. The resulting number is itself combinable. You don't have to learn about "web"s of numbers or "web"s of "web"s of numbers.

If elementary school children can master this principle, then perhaps we can, too. How can we make programming more like addition?

Well, addition is simple because we have (+) and 0. (+) ensures that we can always convert more than one number into exactly number:

(+) :: Int -> Int -> Int

... and 0 ensures that we can always convert less than one number into exactly one number by providing a suitable default:

0 :: Int

This will look familiar to Haskell programmers: these type signatures resemble the methods of the Monoid type class:

class Monoid m where
    -- `mappend` is analogous to `(+)`
    mappend :: m -> m -> m

    -- `mempty` is analogous to `0`
    mempty  :: m

In other words, the Monoid type class is the canonical example of this Haskell architectural style. We use mappend and mempty to combine 0 or more ms into exactly 1 m. The resulting m is still combinable.

Not every Haskell abstraction implements Monoid, nor do they have to because category theory takes this basic Monoid idea and generalizes it to more powerful domains. Each generalization retains the same basic principle of preserving combinability.

For example, a Category is just a typed monoid, where not all combinations type-check:

class Category cat where
    -- `(.)` is analogous to `(+)`
    (.) :: cat b c -> cat a b -> cat a c

    -- `id` is analogous to `0`
    id  :: cat a a

... a Monad is like a monoid where we combine functors "vertically":

-- Slightly modified from the original type class
class Functor m => Monad m where
    -- `join` is analogous to `(+)`
    join :: m (m a) -> m a

    -- `return` is analogous to `0`
    return :: a -> m a

... and an Applicative is like a monoid where we combine functors "horizontally":

-- Greatly modified, but equivalent to, the original type class
class Functor f => Applicative f where
    -- `mult` is is analogous to `(+)`
    mult :: f a -> f b -> f (a, b)

    -- `unit` is analogous to `0`
    unit :: f ()

Category theory is full of generalized patterns like these, all of which try to preserve that basic intuition we had for addition. We convert more than one thing into exactly one thing using something that resembles addition and we convert less than one thing into exactly one thing using something that resembles zero. Once you learn to think in terms of these patterns, programming becomes as simple as basic arithmetic: combinable components go in and exactly one combinable component comes out.

These abstractions scale limitlessly because they always preserve combinability, therefore we never need to layer further abstractions on top. This is one reason why you should learn Haskell: you learn to how to build flat architectures.

Tuesday, April 1, 2014

Worst practices are viral for the wrong reasons

This short post describes my hypothesis for why poor software engineering practices are more viral. Simply stated:

Worst practices spread more quickly because they require more attention

The following two sections outline specific manifestations of this problem and how they promote the dissemination of worst practices.

Management

Corollary 1: Teams with the greatest technical debt mentor the most employees.

Scenario 1: You manage a team that ships an important feature but with a poor execution. You can make a compelling case to management that your team needs more head count to continue supporting this feature. More employees reporting to you increases your chances of promotion and you enjoy rewarding opportunities to mentor new software engineers in your specific brand of programming.

Scenario 2: You manage a team that ships an important feature in the exact same amount of time but with an amazing execution. Your code requires little maintenance, so little in fact that your team is now obsolete. If you're lucky you are retasked to work on a new project; if you're unlucky you are fired because your skills are no longer relevant. Training others how to write software of excellent quality is out of the question either way.

In other words, software engineers that produce excellent code cannot easily pass on their wisdom to the next generation of programmers unless they go into academia.

Open source projects

Corollary 2: Poorly implemented libraries or programming languages generate more buzz.

Scenario 1: You write a useful library or programming language, but it is incredibly buggy and problematic. This generates several scathing blog posts criticizing your work, prompting others to clarify in response how to work around those issues. Don't forget to check Stack Overflow, which is full of people asking for help regarding how to use your project. Did you know that people measure popularity in terms of Stack Overflow questions?

Scenario 2: You release a useful library or programming language at the same time, except that through some miracle you get it correct on the first try. Version 1.0 is your first and last release. That means you can't advertise your library through announcement posts for subsequent releases and there are only so many ways that you can say "it works" before people accuse you of shameless advertising.

In other words, well-written open source projects can become victims of their own success, with fewer opportunities to market themselves.

Conclusion

Obviously the above examples are exaggerated hyperboles and reality is somewhere in the middle. There is no perfect software project or team and there is a limit to how much crap a company or user will tolerate.

However, I write this because I value simple and understandable software and I want to start a discussion on how software engineering can culturally reward correctness and reliability. We need to think of ways to encourage programming excellence to spread.

Tuesday, March 25, 2014

Introductions to advanced Haskell topics

Many people bemoan the sharp divide between experts and beginning Haskell programmers. One thing I've noticed is that "advanced" Haskell topics all have one thing in common: there exists only one good tutorial on that topic and only the experts have found it. This post is a collection of links to what I consider to be the definitive tutorials on these topics.

Monads

Monads are technically not an advanced Haskell topic, but I include this topic in the list because the vast majority of monad tutorials are terrible and/or misleading. In my opinion, the best monad tutorial (Titled: "You could have invented monads") was written 8 years ago by Dan Piponi and nobody has ever improved on it since.

Monad Transformers

One thing that perpetually boggles me is that the blogosphere has almost no posts explaining how to use monad transformers. In fact, the only good explanation I know of is a very simple paper by Martin Grabmuller (Titled: "Monad Transformers - Step by Step"). Most beginners will skip over an academic paper like this when searching for tutorials since most people don't associate the word "academic" with beginner-friendly. However, don't let that fool you: this paper is very well written and easy to read.

Parsing

Outsiders comment that monads only exist to paper over Haskell's weaknesses until they discover monadic parsers. The truly mind-blowing moment is when you learn how simple they are to implement when you read Hutton and Meijer's functional pearl on monadic parsing (Titled: "Monadic Parsing in Haskell"). They walk through how to implement a backtracking parser, how to implement the Monad type class for this parser, and how to define several parsing primitives and chain them together into more sophisticated functionality. After you read this functional pearl you will never doubt the significance of monads ever again.

You also want to understand parser combinators for another reason: Haskell parser combinator libraries are light-years ahead of Haskell regular expression libraries. This confuses many people who come to Haskell from a language that uses regular expressions heavily (such as Python or Perl) and they wonder why regular expression support is so bad in Haskell. The reason is that nearly all Haskell programmers use parser combinator libraries such as parsec or attoparsec.

Free Monads

This is the only case where I will plug a post of my own. I wrote a tutorial on free monads (Titled: "Why free monads matter") because back when I was trying to understand free monads I could not find any beginner-friendly explanation. Everything on this subject was written for a target audience of mathematicians and would motivate their use cases using mathematical examples.

I wanted to explain the motivation behind free monads in terms that a programmer would understand and my post explains the topic in terms of creating a syntax tree and interpreting that syntax tree. Understanding free monads will greatly improve your understanding for how to build monadic domain-specific languages within Haskell. Also, many people comment that understanding free monads greatly improves their mental model for Haskell IO as well.

Coroutines

Haskell coroutine libraries (like pipes/conduit/machines) have what appear to be impenetrable implementations ... until you read Mario Blazevic's article on coroutines in issue 19 of The Monad Reader (Titled: Coroutine Pipelines").

In that article he introduces the Coroutine type (which is now more commonly known as the free monad transformer) which unifies every coroutine implementation into a single pattern. Once you understand his article, you will understand the internals of coroutine libraries and you will be able to easily write a streaming library of your own.

Lenses

Lens tutorials are the new monad tutorials, and the one mistake that all lens tutorials make is that none of them clearly explain how lenses work. The only post that ever got this right was the original post by Twan van Laarhoven that first introduced the modern formulation of lenses (Titled: "CPS Functional references"). The problem is that post is incredibly hard to find even through a dedicated Google search for lens tutorials because the title is obscure and the post only has a single mention of the word "lens". I actually had trouble finding this post even by browsing his blog archive because I didn't recognize the title.

Honorable mention: The lens-family-core library is the only lens library that actually hews closely to Twan's original presentation, so I highly recommend that you study its source code to further improve your understanding.

Church encoding

I still can't find a good blog post on Church encoding, so I will not link to anything. However, once you learn how to Church encode a data type you will suddenly see this pattern everywhere.

Equally important, many high-performance Haskell libraries (like attoparsec) use Church encoding for efficiency reasons and understanding Church encoding will allow you to understand their internal implementations.

Conclusion

I'm really disappointed that so many critical Haskell topics only have a single source of truth that is difficult to find. I hypothesize that the problem is that Haskell is rooted in computer science academia, which discourages duplicate publications on the same topic. If you want to prove me wrong, then please take the time to increase the number of high quality Haskell tutorials on these and other key topics.

Another troubling trend is that tendency for people to bandwagon on particular tutorial topics that are sexy (first monads, now lenses). I wish people would spend more time diversifying coverage on more mundane topics, like how to apply well-established libraries to architect various types of applications.

Monday, March 3, 2014

How to model handles with pipes

I receive repeated questions for how to implement a Handle-like API similar to io-streams using pipes even after I outlined how to do this about a year ago. However, I don't blame these people because that tutorial uses an out-of-date (and uglier) pipes-3.2 API and also takes an approach that I felt was a bit inconsistent and missed the mark, so this is a rewrite of the original post, both to upgrade the code to the latest pipes-4.1 API and to make some key changes to the original idioms to improve conceptual consistency.

I usually find that the best way to introduce this Handle-like intuition for pipes is to make an analogy to io-streams, so this post assumes that you are familiar with both pipes and io-streams. If not, then you can read the pipes tutorial or the io-streams tutorial, both of which I wrote! I want to really emphasize that I mainly discuss io-streams as a stepping stone for understanding the equivalent pipes abstractions, but the purpose is not to reimplement io-streams in pipes. Rather I want to introduce a more general IO-free abstraction that just happens to overlap a bit with io-streams.


Types


I'll begin with the correspondence between io-streams and pipes types:
import Pipes

type InputStream a = Effect IO (Maybe a)

type OutputStream a = Maybe a -> Effect IO ()

makeInputStream :: IO (Maybe a) -> InputStream a
makeInputStream = lift

makeOutputStream :: (Maybe a -> IO ()) -> OutputStream a
makeOutputStream f = lift . f
For example, the pipes analogs of stdin and stdout from io-streams would be:
import qualified Data.ByteString as B
import qualified System.IO as IO
import qualified Data.Foldable as F

stdin :: InputStream B.ByteString
stdin = makeInputStream $ do
    bs <- B.hGetSome IO.stdin 4096
    return (if B.null bs then Nothing else Just bs)

stdout :: OutputStream B.ByteString
stdout = makeOutputStream (F.mapM_ B.putStr)
Notice how this does not convert the read and write actions to repetitive streams. Instead, you just trivially wrap them using lift, promoting them to Effect. However, for the rest of this post I will use Effect instead of InputStream and OutputStream to avoid confusion between the two libraries.

Also, the pipes implementations of makeInputStream and makeOutputStream are pure, whereas their io-streams counterparts require IO. In fact, io-streams fudges a little bit and implements stdin and stdout in terms of unsafePerformIO, otherwise they would have these types:
stdin  :: IO (InputStream  ByteString)
stdout :: IO (OutputStream ByteString)
io-streams is tightly integrated with IO (thus the name), but I believe we can do better.


Reading/Writing


The pipes analogs of the io-streams read and write commands are await and yield:
import Prelude hiding (read)

read :: Monad m => Consumer (Maybe a) m (Maybe a)
read = await

write :: Monad m => Maybe a -> Producer (Maybe a) m ()
write = yield
Here are examples of how you might use them:
import Control.Applicative
import Data.Monoid

-- `Consumer (Maybe a) m (Maybe b)` is the analog of
-- `InputStream a -> IO (InputStream b)`
combine
    :: Monad m
    => Consumer (Maybe B.ByteString) m (Maybe B.ByteString)
combine = do
    mstr1 <- read
    mstr2 <- read
    return $ liftA2 (<>) mstr1 mstr2

-- `Maybe a -> Producer (Maybe b) m ()` is the analog of
-- `OutputStream b -> IO (OutputStream a)`
duplicate :: Monad m => Maybe a -> Producer (Maybe a) m ()
duplicate ma = do
    write ma
    write ma
Compare to io-streams:
combine :: InputStream B.ByteString -> IO (Maybe B.ByteString)
combine is = do
    mbs1 <- read is
    mbs2 <- read is
    return $ liftA2 (<>) mbs1 mbs2

duplicate :: OutputStream a -> a -> IO ()
duplicate os a = do
    write a os
    write a os
Unlike io-streams, pipes does not need to specify the upstream source for read commands. This is because we can supply read with an input stream using (>~):
>>> runEffect $ stdin >~ read
Test<Enter>
Just "Test\n"
>>> runEffect $ stdin >~ combine
Hello<Enter>
world!<Enter>
Just "Hello\nworld!\n"
In fact, the read command in the first example is totally optional, because read is just a synonym for await, and one of the pipes laws states that:
f >~ await = f
Therefore, we could instead just write:
>>> runEffect stdin
Test<Enter>
Just "Test\n"
Dually, we don't need to specify the downstream source for write. This is because you connect write with an output stream using (~>)
>>> :set -XOverloadedStrings
>>> runEffect $ (write ~> stdout) (Just "Test\n")
Test
>>> runEffect $ (duplicate ~> stdout) (Just "Test\n")
Test
Test
Again, the write in the first example is optional, because write is just a synonym for yield and one of the pipes laws states that:
yield ~> f = f
Therefore, we can simplify it to:
>>> runEffect $ stdout (Just "Test")
Test

Mixed reads and writes


Like io-streams, we do not need to specify the entire streaming computation up front. We can step these input and output streams one value at a time using runEffect instead of consuming them all in one go. However, you lose no power and gain a lot of elegance by writing everything entirely within pipes.

For example, just like io-streams, you can freely mix reads and writes from multiple diverse sources:
inputStream1 :: Effect IO (Maybe A)
inputStream2 :: Effect IO (Maybe B)

mixRead :: Effect IO (Maybe A, Maybe B)
mixRead = do
    ma <- inputStream1 >~ read
    mb <- inputStream2 >~ read
    return (ma, mb)

outputStream1 :: Maybe a -> Effect IO ()
outputStream2 :: Maybe a -> Effect IO ()

mixWrite :: Maybe a -> Effect IO ()
mixWrite ma = do
    (write ~> outputStream1) ma
    (write ~> outputStream2) ma
Oh, wait! Didn't we just get through saying that you can simplify these?
-- Reminder:
-- f >~  read = f
-- write ~> f = f

mixRead :: Effect IO (Maybe A, Maybe B)
mixRead = do
    ma <- inputStream1
    mb <- inputStream2
    return (ma, mb)

mixWrite :: Maybe a -> Effect IO ()
mixWrite ma = do
    outputStream1 ma
    outputStream2 ma
In other words, if we stay within pipes we can read and write from input and output streams just by directly calling them. Unlike io-streams, we don't need to use read and write when we select a specific stream. We only require read and write if we wish to inject the input or output stream later on.


Transformations


What's neat about the pipes idioms is that you use the exact same API to implement and connect transformations on streams. For example, here's how you implement the map function from io-streams using pipes:
map :: Monad m => (a -> b) -> Consumer (Maybe a) m (Maybe b)
map f = do
    ma <- read
    return (fmap f ma)
Compare that to the io-streams implementation of map:
map :: (a -> b) -> InputStream a -> IO (InputStream b)
map f is = makeInputStream $ do
    ma <- read is
    return (fmap f ma)
Some differences stand out:
  • The pipes version does not need to explicitly thread the input stream
  • The pipes version does not require makeInputStream
  • The pipes transformation is pure (meaning no IO)
Now watch how we apply this transformation:
>>> runEffect $ stdin >~ map (<> "!")
Test<Enter>
Just "Test!\n"
The syntax for implementing and connecting map is completely indistinguishable from the syntax we used in the previous section to implement and connect combine. And why should there be a difference? Both of them read from a source and return a derived value. Yet, io-streams distinguishes between things that transform input streams and things that read from input streams, both in the types and implementation:
iReadFromAnInputStream :: InputStream a -> IO b
iReadFromAnInputStream is = do
    ma <- read is
    ...

iTransformAnInputStream :: InputStream a -> IO (InputStream b)
iTransformAnInputStream is = makeInputStream $ do
    ma <- read is
    ...
This means that io-streams code cannot reuse readers as input stream transformations or, vice versa, reuse input stream transformations as readers.

As you might have guessed, the dual property holds for writers and output stream transformations. In pipes, we model both of these using a Producer Kleisli arrow:
contramap
    :: Monad m => (a -> b) -> Maybe a -> Producer (Maybe b) m ()
contramap f ma = yield (fmap f ma)
... and we connect them the same way:
>>> runEffect $ (contramap (<> "!\n") ~> stdout) (Just "Test")
Test!
... whereas io-streams distinguishes these two concepts:
iWriteToAnOutputStream :: OutputStream b -> Maybe a -> IO ()
iWriteToAnOutputStream os ma = do
    write os ma
    ...
    
iTransformAnOutputStream :: OutputStream b -> IO (OutputStream a)
iTransformAnOutputStream os = makeOutputStream $ \ma -> do
    write os ma
    ...

End of input


None of this machinery is specific to the Maybe type. I only used Maybe for analogy with io-streams, which uses Maybe to signal end of input. For the rest of this post I will just drop the Maybes entirely to simplify the type signatures, meaning that I will model the relevant types as:
inputStream :: Effect m a

outputStream :: a -> Effect m ()

inputStreamTransformation :: Consumer a m b

outputStreamTransformation :: a -> Producer b m ()
This is also the approach that my upcoming pipes-streams library will take, because pipes does not use Maybe to signal end of input. Instead, if you have a finite input you represent that input as a Producer:
finiteInput :: Producer a m ()
What's great is that you can still connect this to an output stream, using for:
for finiteInput outputStream :: Effect m ()
Dually, you model a finite output as a Consumer:
finiteOutput :: Consumer a m ()
... and you can connect that to an input stream using (>~):
inputStream >~ finiteOutput :: Effect m ()
These provide a convenient bridge between classic pipes abstractions and handle-like streams.


Polymorphism


Interestingly, we can use the same (>~) operator to:
  • connect input streams to input stream transformations, and:
  • connect input stream transformations to each other.
This is because (>~) has a highly polymorphic type which you can specialize down to two separate types:
-- Connect an input stream to an input stream transformation,
-- returning a new input stream
(>~) :: Monad m
     => Effect     m a
     -> Consumer a m b
     -> Effect     m b

-- Connect two input stream transformations together,
-- returning a new input stream transformation
(>~) :: Monad m
    => Consumer a m b
    -> Consumer b m c
    -> Consumer a m c
Dually, we can use the same (~>) operator to:
  • connect output stream transformations to output streams, and:
  • connect output stream transformations to each other.
Again, this is because (~>) has a highly polymorphic type which you can specialize down to two separate types:
-- Connect an output stream transformation to an output stream,
-- returning a new output stream
(~>) :: Monad m
     => (a -> Producer b m ())
     -> (b -> Effect     m ())
     -> (a -> Effect     m ())
     
-- Connect two output stream transformations together,
-- returning a new output stream transformation
(~>) :: Monad m
     => (a -> Producer b m ())
     -> (b -> Producer c m ())
     -> (a -> Producer c m ())
In fact, you will see these specialized type signatures in the documentation for (>~) for (~>). Hopefully, the analogy to input streams and output streams will help you see those types in a new light.


Categories


People familiar with pipes know that await/read and (>~) form a category, so we get three equations for free from the three category laws:
-- Reading something = calling it
read >~ f = f

-- `read` is the identity input stream transformation
f >~ read

-- It doesn't matter how you group sinks/transformations
(f >~ g) >~ h = f >~ (g >~ h)
Dually, yield/write and (~>) form a category, so we get another three equations for free:
-- Writing something = calling it
f >~ write = f

-- `write` is the identity output stream transformation
write >~ f

-- It doesn't matter how you group sources/transformations
(f ~> g) ~> h = f ~> (g ~> h)
Those are really nice properties to use when equationally reasoning about handles and as we've already seen they allow you to simplify your code in unexpected ways.


Bidirectionality


But wait, there's more! pipes actually has a fully bidirectional core in the Pipes.Core module. You can generalize everything above to use this bidirectional interface with surprising results. For example, in the bidirectional version of pipes-based streams, you can have streams that both accept input and produce output upon each step:
inputOutputStream :: a -> Effect m b
You also get the following generalized versions of all commands and operators:
  • request generalizes read, accepting an argument for upstream
  • respond generalizes write, returning a value from downstream
  • (\>\) generalizes (>~)
  • (/>/) generalizes (~>)
I will leave that as a teaser for now and I will discuss this in greater detail once I finish writing a new tutorial for the underlying bidirectional implementation.


Conclusion


pipes provides an elegant way to model Handle-like abstractions in a very compact way that requires no new extensions to the API. I will soon provide a pipes-streams library that implements utilities along these lines to further solidify these concepts.

One thing I hope people take away from this is that a Handle-based API need not be deeply intertwined with IO. All the commands and connectives from pipes-based handles are completely independent of the base monad and you can freely generalize the pipes notion of handles to pure monads.

Saturday, February 22, 2014

Reasoning about stream programming

This post answers a question people sometimes ask me about pipes, which I will paraphase here:

If resource management is not a core focus of pipes, why should I use pipes instead of lazy IO?

Many people who ask this question discovered stream programming through Oleg, who framed the lazy IO problem in terms of resource management. However, I never found this argument compelling in isolation; you can solve most resource management issues simply by separating resource acquisition from the lazy IO, like this:

import System.IO

main = withFile "example.txt" ReadMode $ \handle -> do
    str <- hGetContents handle
    doSomethingWith str

You can now easily reason about resource management: the handle lifetime matches the duration of the callback provided to withFile. This solves 95% of resource management scenarios, and libraries like resourcet/conduit/pipes-safe handle the more exotic cases.

If lazy IO were only a problem for 5% of all use cases, I wouldn't object to its use. Rather, resource management issues are symptoms of a larger problem: lazy IO conflicts with equational reasoning about ordering of effects.

The issue with lazy IO

Lazy IO is teaching an entire generation of Haskell programmers to "not think very hard" about ordering of side effects. Some consider this a feature, but I believe it is a disadvantage because it conflicts with the original spirit of Haskell IO: preserving equational reasoning. Lazy IO ties the effect order to evaluation order, and equational reasoning does not usually preserve evaluation order.

In my eyes, Haskell's treatment of IO was one of the major revolutions in programming language theory, because for the first time ever one could formally reason about side effects within the language rather than building an external formal model. This empowered lay programmers to reason about side effects using simple substitution and greatly lowered the entry barrier to formal reasoning in general.

Consequently, I was very disappointed to see Haskell squander this immense progress by teaching new programmers to abandon reasoning about ordering of side effects through the use of lazy IO. We should be democratizing computer science by helping the average programmer be precise, rather than encouraging imprecision and instructing newcomers to defer precision to the experts.

Pipes

I built pipes to restore equational reasoning to stream programming. Besides eschewing lazy IO, pipes also promotes equational reasoning in other ways:

  • pipes obeys several laws inspired by category theory that simplify equational rearrangements
  • The core pipes implementation is minimal
  • pipes extensions are pipes-independent, so you can reason about them orthogonally

pipes does not come with every bell and whistle pre-installed because pipes is a minimal substrate upon which other streaming abstractions can be built. This reflects the Haskell ethos of only using the minimum power necessary for the task at hand, which improves correctness and minimizes surprise.

Therefore, you can think of pipes as a gateway drug to equationally reasoning about stream processing. pipes empowers you to be correct and rigorous so that you can spend less time chasing bugs and more time releasing features.

Those who are interested in equationally reasoning about stream programming can read the pipes tutorial, which illustrates how many common streaming abstractions that we take for granted possess beautiful equational properties that we can harness for real applications.

Saturday, February 8, 2014

pipes-http-1.0: Streaming HTTP/HTTPS clients

The pipes-http package now provides an HTTP client for pipes. This was made possible by Michael Snoyman, who released http-client and http-client-tls, which are a conduit-independent subset of http-conduit. I wanted to thank Michael for releasing those packages, which greatly simplified my job. I even get TLS support for free thanks to him, which is incredible.

I've chosen to write the thinnest layer possible, only providing functions to stream in request bodies and stream out response bodies. Everything else is re-exported from http-client and http-client-tls. Hopefully this will encourage shared contributions to those libraries from both pipes and conduit users.

I'll show the library in action with an example of how you would query and format the titles of the top 10 posts on /r/haskell:

-- haskell-reddit.hs

{-# LANGUAGE OverloadedStrings #-}

import Control.Lens (_Right, (^..))
import Control.Lens.Aeson (key, values, _String)
import Control.Monad.Trans.State.Strict (evalStateT)
import Data.Aeson.Parser (json')
import qualified Data.Text    as T
import qualified Data.Text.IO as T
import Pipes.Attoparsec (parse)
import Pipes.HTTP

main = do
    req <- parseUrl "http://www.reddit.com/r/haskell.json"
    withManager defaultManagerSettings $ \m ->
        withHTTP req m $ \resp -> do
            json <- evalStateT (parse json') (responseBody resp)

            let titles :: [T.Text]
                titles = json  ^.. _Right
                                 . key "data"
                                 . key "children"
                                 . values
                                 . key "data"
                                 . key "title"
                                 . _String

            mapM_ (T.putStrLn . format) (take 10 titles)

format :: T.Text -> T.Text
format txt =
    if   T.length txt <= columns
    then T.concat [bullet,                txt          ]
    else T.concat [bullet, T.take columns txt, ellipsis]
  where
    bullet   = "[*] "
    ellipsis = "..."
    columns = 60 - (T.length bullet + T.length ellipsis)

This example uses pipes-attoparsec to parse the response body as JSON and lens-aeson to query the JSON value for titles:

$ ./haskell-reddit
[*] My Haskell will
[*] What are some common "design patterns" when it comes ...
[*] The Resumption and Reactive Resumption Monads
[*] `cabal install` failing in Vagrant
[*] A Lazy Evaluation - Why I Find Learning Haskell Hard
[*] The Essence Of Reynolds
[*] A test driven haskell course
[*] Wheb -- A WAI framework.
[*] I wrote a simple &lt;200 line Forth interpreter. Does...
[*] GHC iOS 7.8 RC1
$

You can find pipes-http on Hackage or on Github.

Wednesday, February 5, 2014

pipes-parse-3.0: Lens-based parsing

pipes-parse-3.0.0 introduces a new lens-based parsing mechanism. These lenses improve the library in two ways:

  • They greatly simplify the API

  • They correctly propagate leftovers further upstream

Overview

The new parsing API consists of three main types, which roughly parallel the three pipes abstractions:

  • Producers, which are unchanged from pipes

    Producer a m x
  • Parsers, which are the parsing analog of Consumers:

    type Parser a m r = forall x . StateT (Producer a m x) m r
  • Lens'es between Producer's, which are the parsing analog of Pipes:

    Lens' (Producer a m x) (Producer b m y)

What's neat is that pipes-parse does not need to provide any operators to connect these three abstractions. All of the tools you need already exist in either transformers and lens (or lens-family-core, if you prefer a simpler lens alternative).

For example, you connect a Parser to a Producer using either runStateT, evalStateT, or execStateT:

                                            +- Result
                                            |
runStateT                                   v
    :: Parser a m r -> Producer a m x -> m (r, Producer a m x)
evalStateT
    :: Parser a m r -> Producer a m x -> m  r
execStateT
    :: Parser a m r -> Producer a m x -> m (   Producer a m x)
                       ^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^
                        Parser Input              Leftovers

These correspond to the three possible ways you might want to run a parser:

  • runStateT: Return the result and leftovers
  • evalStateT: Return only the result
  • execStateT: Return only the leftovers

In fact, two of these functions closely parallel conduit operators:

  • runStateT parallels ($$+)
  • evalStateT parallels ($$)

You can also connect Producers to Lens'es using (^.) or view:

(^.) :: Producer a m x
     -> Lens' (Producer a m x) (Producer b m y)
     -> Producer b m y

(^.) parallels conduit's ($=) operator.

If you want to connect Lens'es to Parsers, you use zoom:

zoom :: Lens' (Producer a m x) (Producer b m y)
     -> Parser b m r
     -> Parser a m r

zoom parallels conduit's (=$) operator.

Finally, you connect Lens'es to each other using (.) (i.e. function composition):

(.) :: Lens' (Producer a m x) (Producer b m y)
    -> Lens' (Producer b m y) (Producer c m z)
    -> Lens' (Producer a m y) (Producer c m z)

(.) parallels conduit's (=$=) operator.

Here's a worked example showing off the new API, including newly-added foldl integration:

import qualified Control.Foldl as L
import Lens.Family ((^.))
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Prelude hiding (span, splitAt)

parser :: Parser Int IO ()
parser = do
    -- Attempt to draw a single element
    a <- draw
    lift (print a)

    -- Sum the next 10 input elements
    b <- zoom (splitAt 10) (L.purely foldAll L.sum)
    lift (print b)

    -- Collect all elements less than 15 into a list
    c <- zoom (span (< 15)) drawAll
    lift (print c)

    -- You can nest `zoom`s
    zoom (span (< 20)) $ do
        d <- zoom (splitAt 3) (L.purely foldAll L.product)
        lift (print d)

        e <- peek
        lift (print e)

    -- ... or compose lenses:
    f <- zoom (span (< 20) . splitAt 3) drawAll
    lift (print f)

    -- Print the remaining elements
    g <- drawAll
    lift (print g)

-- Lenses can modify `Producer`s, too
producer :: Monad m => Producer Int m (Producer Int m ())
producer = each [1..] ^. span (< 25)

We get the following output if we connect our parser to our producer:

>>> evalStateT parser producer
Just 1
65
[12,13,14]
4080
Just 18
[18,19]
[20,21,22,23,24]
>>>

Leftovers

There is a subtle difference between pipes-parse and conduit: pipes-parse correctly propagates leftovers further upstream whereas conduit does not. To illustrate this, let's begin from the following implementation of peek for conduit:

import Control.Monad.Trans.Class (lift)
import Data.Conduit
import Data.Conduit.List (isolate, sourceList)

peek :: Monad m => Sink a m (Maybe a)
peek = do
    ma <- await
    case ma of
        Nothing -> return ()
        Just a  -> leftover a
    return ma

peek attempts to draw a value using await and then undraws the value using leftover before returning the result. peek will work correctly when used like this:

source :: Monad m => Source m Int
source = sourceList [1, 2]

sink1 :: Show a => Sink a IO ()
sink1 = do
    ma1 <- peek
    ma2 <- peek
    lift $ print (ma1, ma2)

If we feed source to sink1, both peeks will return Just 1, since the first peek undraws the 1 to make it available for the second peek:

>>> source $$ sink1
(Just 1,Just 1)

But what happens if we delimit the first peek using isolate, which only allows a fixed number of values to flow through? This is a common parsing request: run a parser within a subset of the input.

sink2 :: Show a => Sink a IO ()
sink2 = do
    ma1 <- isolate 10 =$ peek
    ma2 <- peek
    lift $ print (ma1, ma2)

However, when you compose two conduits, the downstream conduit discards all leftovers when done. Michael is up front about this in the documentation for (=$):

"Leftover data returned from the Sink will be discarded."

There are similar warnings for ($=) and (=$=), all of which discard leftovers from the right component. We can run sink2 to trigger this behavior:

>>> source $$ sink2
(Just 1,Just 2)

The undrawn 1 is irreversibly lost when (=$) completes, which is why the second peek reads a 2 instead of a 1.

The analogous pipes code gets this correct:

import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Prelude hiding (splitAt)

parser :: Show a => Parser a IO ()
parser = do
    ma1 <- zoom (splitAt 10) peek
    ma2 <- peek
    lift $ print (ma1, ma2)

producer :: Monad m => Producer Int m ()
producer = each [1, 2]

The pipes-parse version correctly restores the undrawn 1 so that the second peek also draws a 1:

>>> evalStateT parser producer
(Just 1,Just 1)

The magic is in the splitAt function, which is the pipes-parse analog of conduit's isolate. Compare the source of isolate (slightly rewritten):

isolate :: Monad m => Int -> Conduit a m a
isolate = loop
  where
    loop 0 = return ()
    loop n = do
        ma <- await
        case ma of
            Nothing -> return ()
            Just a  -> do
                yield a
                loop (n - 1)

... to the source of splitAt (also slightly changed to resemble isolate):

splitAt
    :: Monad m
    => Int
    -> Lens' (Producer a m x) (Producer a m (Producer a m x))
splitAt n0 k p0 = fmap join (k (loop n0 p0))
  where
    loop 0 p = return p
    loop n p = do
        x <- lift (next p)
        case x of
            Left   r      -> return (return r)
            Right (a, p') -> do
                yield a
                loop (n - 1) p'

The internal loop function in splitAt corresponds to the internal loop from isolate. The extra magic lies within the first line:

splitAt n0 k p0 = fmap join (k (loop n0 p0))

Lens aficionados will recognize this as the dependency-free version of:

splitAt n0 = iso (loop n0) join

This not only instructs the Lens in how to isolate out the first ten elements (using loop), but also how to reverse the process and merge the remaining elements back in (using join). This latter information is what makes leftover propagation work.

Getters

Note that not all transformations are reversible and therefore cannot propagate leftovers upstream. Fortunately, the lens model handles this perfectly: just define a Getter instead of a Lens' for transformations that are not reversible:

getter :: Getter (Producer a m x) (Producer b m y)

A Getter will only type-check as an argument to (^.) and not zoom, so you cannot propagate unDraw through a Getter like you can with a Lens':

zoom getter unDraw -- Type error

However, you can still use the Getter to transform Producers:

producer ^. getter -- Type checks!

This provides a type-safe way to distinguish transformations that can propagate leftovers from those that cannot.

In practice, pipes-based parsing libraries just provide functions between Producers instead of Getters for simplicity:

getter :: Producer a m x -> Producer b m y

... but you can keep using lens-like syntax if you promote them to Getters using the to function from lens:

to :: (a -> b) -> Getter a b

producer ^. to getter

Lens Support

Some people may worry about the cost of using lens in conjunction with pipes-parse because lens is not beginner-friendly and has a large dependency graph, so I'd like to take a moment to advertise Russell O'Connor's lens-family-core library. lens-family-core is a beginner-friendly lens-alternative that is (mostly) lens-compatible. It provides much simpler, beginner-friendly types and has a really tiny dependency profile.

Note that pipes-parse does not depend on either lens library. This is one of the beautiful aspects about lenses: you can write a lens-compatible library using nothing more than stock components from the Prelude and the transformers library. You can therefore combine pipes-parse with either lens-family-core or lens without any conflicts. This provides a smooth transition path from the beginner-friendly lens-family-core library to the expert-friendly lens library.

Laws

The lens approach is nice because you get many laws for free. For example, you get several associativity laws, like the fact that (^.) associates with (.):

(producer ^. lens1) ^. lens2 = producer ^. (lens1 . lens2)

Similarly, (.) associates with zoom:

zoom lens1 (zoom lens2 parser) = zoom (lens1 . lens2) parser

... and the trio of evalStateT/(^.)/zoom all associate:

evalStateT parser (producer ^. lens)
    = evalStateT (zoom lens parser) producer

Also, lens composition associates, because it's just function composition:

(lens1 . lens2) . lens3 = lens1 . (lens2 . lens3)

You even get the following identity laws for free:

producer ^. id = producer

zoom id producer = producer

f . id = f

id . f = f

However, there is one caveat: many of the lenses in pipes-parse do not satisfy certain lens laws. Specifically, they do not satisfy these laws:

-- Law violation #1
view lens (set lens x a) /= x

-- Law violation #2
zoom lens $ do x <- m  /=  do x <- zoom lens m
               f x            zoom lens (f x)

Law violation #1 arises because I don't know of a lens-like abstraction that type-checks as a Getter and a Focusing, but not a Setter.

However, law #2 directly conflicts with a core pipes-parse feature, specifically lenses like splitAt that delimit parsers. Here's why:

zoom (splitAt n) $ do x <- m  /=  do x <- zoom (splitAt n) m
                      f x            zoom (splitAt n) (f x)

Limiting one parser to n elements is not the same as limiting its two sub-parsers to n elements each. So if you use pipes-parse lenses you cannot rely on zoom being a monad morphism when doing equational reasoning. This was a tough call for me to make, but I felt that delimiting parsers were more important than the monad morphism laws for zoom. Perhaps there is a more elegant solution that I missed that resolves this conflict, but I'm still pleased with the current solution.

Generality

Notice how all of these functions and laws are completely pipes-independent. Any streaming abstraction that has some Producer-like type can implement lens-based parsing, too, and also get all of these laws for free, including pure streams (such as strict or lazy Text) or even lazy IO. Other streaming libraries can therefore benefit from this exact same trick.

Batteries included

Downstream libraries have been upgraded to use the pipes-parse API, including pipes-bytestring, pipes-binary, and pipes-attoparsec.This means that right now you can do cool things like:

import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Pipes.Binary
import qualified Pipes.ByteString as ByteString

parser :: Parser ByteString IO ()
parser = zoom (ByteString.splitAt 100 . decoded) $ do
    x <- draw  -- Draw a decoded Int
    lift $ print (x :: Maybe Int)
    unDraw 99  -- This undraws the encoded bytes

producer :: Monad m => Producer ByteString m ()
producer = for (each [(1::Int)..]) encode

The above parser composes two lenses so that it zooms in on a stream of decoded ints that consume no more than 100 bytes. This will transmute draw to now receive decoded elements and unDraw will magically re-encode elements when pushing back leftovers:

>>> producer' <- execStateT parser producer
Just 1
>>> evalStateT parser producer'
Just 99

Also, Michael Thompson has released a draft of pipes-text on Hackage. This means you can parse a byte stream through a UTF-8 lens and any undrawn input will be encoded back into the original byte stream as bytes. Here is an example program show-casing this neat feature:

-- decode.hs

import Data.ByteString (ByteString)
import Data.Text       (Text)
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import qualified Pipes.ByteString as ByteString
import qualified Pipes.Text       as Text

-- Retrieve all `Text` chunks up to 10 characters
parser :: Monad m => Parser ByteString m [Text]
parser = zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll

main = do
    (textChunks, leftovers) <- runStateT parser ByteString.stdin
    print textChunks

    -- Now print the remaining `ByteString` chunks
    byteChunks <- evalStateT drawAll leftovers
    print byteChunks

The unused bytes from the decoded stream get correctly undrawn to the original byte stream!

$ ./decode
Hello, 世界!!!<Enter>
["Hello, \19990\30028!"]
abcdefg<Enter>
<Ctrl-D>
["!!\n","abcdefg\n"]
$

The remainder of the first line is undrawn by the decoder and restored back as the original encoding bytes.

Note that the above example is line-buffered, which is why the program does not output the Text chunks immediately after the 10th input character. However, if you disable line buffering then all chunks have just a single character and the example wouldn't illustrate how leftovers worked.

The above example could have also been written as a single Parser:

parser :: Parser ByteString IO ()
parser = do
    texts <- zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll
    lift (print texts)
    bytes <- drawAll
    lift (print bytes)

main = evalStateT parser ByteString.stdin

... but I wanted to make the leftover passing explicit to emphasize that the leftover behavior holds correctly whether or not you exit and re-enter pipes.

Conclusion

The pipes-parse API lets you propagate leftovers upstream, encode leftover support in the type system, and equationally reason about code with several theoretical laws. Additionally, pipes-parse reuses existing functions and concepts from lens and StateT rather than introducing a new set of abstractions to learn.

Note that pipes-parse used to have a some FreeT-based operations as well. These have been moved to a separate pipes-group library (and upgraded to use lenses) since they are conceptually orthogonal to parsing and I will blog about this library in a separate post.

You can find pipes-parse on Hackage or Github, and it comes with an extended tutorial.