Wednesday, February 5, 2014

pipes-parse-3.0: Lens-based parsing

pipes-parse-3.0.0 introduces a new lens-based parsing mechanism. These lenses improve the library in two ways:

  • They greatly simplify the API

  • They correctly propagate leftovers further upstream

Overview

The new parsing API consists of three main types, which roughly parallel the three pipes abstractions:

  • Producers, which are unchanged from pipes

    Producer a m x
  • Parsers, which are the parsing analog of Consumers:

    type Parser a m r = forall x . StateT (Producer a m x) m r
  • Lens'es between Producer's, which are the parsing analog of Pipes:

    Lens' (Producer a m x) (Producer b m y)

What's neat is that pipes-parse does not need to provide any operators to connect these three abstractions. All of the tools you need already exist in either transformers and lens (or lens-family-core, if you prefer a simpler lens alternative).

For example, you connect a Parser to a Producer using either runStateT, evalStateT, or execStateT:

                                            +- Result
                                            |
runStateT                                   v
    :: Parser a m r -> Producer a m x -> m (r, Producer a m x)
evalStateT
    :: Parser a m r -> Producer a m x -> m  r
execStateT
    :: Parser a m r -> Producer a m x -> m (   Producer a m x)
                       ^^^^^^^^^^^^^^          ^^^^^^^^^^^^^^
                        Parser Input              Leftovers

These correspond to the three possible ways you might want to run a parser:

  • runStateT: Return the result and leftovers
  • evalStateT: Return only the result
  • execStateT: Return only the leftovers

In fact, two of these functions closely parallel conduit operators:

  • runStateT parallels ($$+)
  • evalStateT parallels ($$)

You can also connect Producers to Lens'es using (^.) or view:

(^.) :: Producer a m x
     -> Lens' (Producer a m x) (Producer b m y)
     -> Producer b m y

(^.) parallels conduit's ($=) operator.

If you want to connect Lens'es to Parsers, you use zoom:

zoom :: Lens' (Producer a m x) (Producer b m y)
     -> Parser b m r
     -> Parser a m r

zoom parallels conduit's (=$) operator.

Finally, you connect Lens'es to each other using (.) (i.e. function composition):

(.) :: Lens' (Producer a m x) (Producer b m y)
    -> Lens' (Producer b m y) (Producer c m z)
    -> Lens' (Producer a m y) (Producer c m z)

(.) parallels conduit's (=$=) operator.

Here's a worked example showing off the new API, including newly-added foldl integration:

import qualified Control.Foldl as L
import Lens.Family ((^.))
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Prelude hiding (span, splitAt)

parser :: Parser Int IO ()
parser = do
    -- Attempt to draw a single element
    a <- draw
    lift (print a)

    -- Sum the next 10 input elements
    b <- zoom (splitAt 10) (L.purely foldAll L.sum)
    lift (print b)

    -- Collect all elements less than 15 into a list
    c <- zoom (span (< 15)) drawAll
    lift (print c)

    -- You can nest `zoom`s
    zoom (span (< 20)) $ do
        d <- zoom (splitAt 3) (L.purely foldAll L.product)
        lift (print d)

        e <- peek
        lift (print e)

    -- ... or compose lenses:
    f <- zoom (span (< 20) . splitAt 3) drawAll
    lift (print f)

    -- Print the remaining elements
    g <- drawAll
    lift (print g)

-- Lenses can modify `Producer`s, too
producer :: Monad m => Producer Int m (Producer Int m ())
producer = each [1..] ^. span (< 25)

We get the following output if we connect our parser to our producer:

>>> evalStateT parser producer
Just 1
65
[12,13,14]
4080
Just 18
[18,19]
[20,21,22,23,24]
>>>

Leftovers

There is a subtle difference between pipes-parse and conduit: pipes-parse correctly propagates leftovers further upstream whereas conduit does not. To illustrate this, let's begin from the following implementation of peek for conduit:

import Control.Monad.Trans.Class (lift)
import Data.Conduit
import Data.Conduit.List (isolate, sourceList)

peek :: Monad m => Sink a m (Maybe a)
peek = do
    ma <- await
    case ma of
        Nothing -> return ()
        Just a  -> leftover a
    return ma

peek attempts to draw a value using await and then undraws the value using leftover before returning the result. peek will work correctly when used like this:

source :: Monad m => Source m Int
source = sourceList [1, 2]

sink1 :: Show a => Sink a IO ()
sink1 = do
    ma1 <- peek
    ma2 <- peek
    lift $ print (ma1, ma2)

If we feed source to sink1, both peeks will return Just 1, since the first peek undraws the 1 to make it available for the second peek:

>>> source $$ sink1
(Just 1,Just 1)

But what happens if we delimit the first peek using isolate, which only allows a fixed number of values to flow through? This is a common parsing request: run a parser within a subset of the input.

sink2 :: Show a => Sink a IO ()
sink2 = do
    ma1 <- isolate 10 =$ peek
    ma2 <- peek
    lift $ print (ma1, ma2)

However, when you compose two conduits, the downstream conduit discards all leftovers when done. Michael is up front about this in the documentation for (=$):

"Leftover data returned from the Sink will be discarded."

There are similar warnings for ($=) and (=$=), all of which discard leftovers from the right component. We can run sink2 to trigger this behavior:

>>> source $$ sink2
(Just 1,Just 2)

The undrawn 1 is irreversibly lost when (=$) completes, which is why the second peek reads a 2 instead of a 1.

The analogous pipes code gets this correct:

import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Prelude hiding (splitAt)

parser :: Show a => Parser a IO ()
parser = do
    ma1 <- zoom (splitAt 10) peek
    ma2 <- peek
    lift $ print (ma1, ma2)

producer :: Monad m => Producer Int m ()
producer = each [1, 2]

The pipes-parse version correctly restores the undrawn 1 so that the second peek also draws a 1:

>>> evalStateT parser producer
(Just 1,Just 1)

The magic is in the splitAt function, which is the pipes-parse analog of conduit's isolate. Compare the source of isolate (slightly rewritten):

isolate :: Monad m => Int -> Conduit a m a
isolate = loop
  where
    loop 0 = return ()
    loop n = do
        ma <- await
        case ma of
            Nothing -> return ()
            Just a  -> do
                yield a
                loop (n - 1)

... to the source of splitAt (also slightly changed to resemble isolate):

splitAt
    :: Monad m
    => Int
    -> Lens' (Producer a m x) (Producer a m (Producer a m x))
splitAt n0 k p0 = fmap join (k (loop n0 p0))
  where
    loop 0 p = return p
    loop n p = do
        x <- lift (next p)
        case x of
            Left   r      -> return (return r)
            Right (a, p') -> do
                yield a
                loop (n - 1) p'

The internal loop function in splitAt corresponds to the internal loop from isolate. The extra magic lies within the first line:

splitAt n0 k p0 = fmap join (k (loop n0 p0))

Lens aficionados will recognize this as the dependency-free version of:

splitAt n0 = iso (loop n0) join

This not only instructs the Lens in how to isolate out the first ten elements (using loop), but also how to reverse the process and merge the remaining elements back in (using join). This latter information is what makes leftover propagation work.

Getters

Note that not all transformations are reversible and therefore cannot propagate leftovers upstream. Fortunately, the lens model handles this perfectly: just define a Getter instead of a Lens' for transformations that are not reversible:

getter :: Getter (Producer a m x) (Producer b m y)

A Getter will only type-check as an argument to (^.) and not zoom, so you cannot propagate unDraw through a Getter like you can with a Lens':

zoom getter unDraw -- Type error

However, you can still use the Getter to transform Producers:

producer ^. getter -- Type checks!

This provides a type-safe way to distinguish transformations that can propagate leftovers from those that cannot.

In practice, pipes-based parsing libraries just provide functions between Producers instead of Getters for simplicity:

getter :: Producer a m x -> Producer b m y

... but you can keep using lens-like syntax if you promote them to Getters using the to function from lens:

to :: (a -> b) -> Getter a b

producer ^. to getter

Lens Support

Some people may worry about the cost of using lens in conjunction with pipes-parse because lens is not beginner-friendly and has a large dependency graph, so I'd like to take a moment to advertise Russell O'Connor's lens-family-core library. lens-family-core is a beginner-friendly lens-alternative that is (mostly) lens-compatible. It provides much simpler, beginner-friendly types and has a really tiny dependency profile.

Note that pipes-parse does not depend on either lens library. This is one of the beautiful aspects about lenses: you can write a lens-compatible library using nothing more than stock components from the Prelude and the transformers library. You can therefore combine pipes-parse with either lens-family-core or lens without any conflicts. This provides a smooth transition path from the beginner-friendly lens-family-core library to the expert-friendly lens library.

Laws

The lens approach is nice because you get many laws for free. For example, you get several associativity laws, like the fact that (^.) associates with (.):

(producer ^. lens1) ^. lens2 = producer ^. (lens1 . lens2)

Similarly, (.) associates with zoom:

zoom lens1 (zoom lens2 parser) = zoom (lens1 . lens2) parser

... and the trio of evalStateT/(^.)/zoom all associate:

evalStateT parser (producer ^. lens)
    = evalStateT (zoom lens parser) producer

Also, lens composition associates, because it's just function composition:

(lens1 . lens2) . lens3 = lens1 . (lens2 . lens3)

You even get the following identity laws for free:

producer ^. id = producer

zoom id producer = producer

f . id = f

id . f = f

However, there is one caveat: many of the lenses in pipes-parse do not satisfy certain lens laws. Specifically, they do not satisfy these laws:

-- Law violation #1
view lens (set lens x a) /= x

-- Law violation #2
zoom lens $ do x <- m  /=  do x <- zoom lens m
               f x            zoom lens (f x)

Law violation #1 arises because I don't know of a lens-like abstraction that type-checks as a Getter and a Focusing, but not a Setter.

However, law #2 directly conflicts with a core pipes-parse feature, specifically lenses like splitAt that delimit parsers. Here's why:

zoom (splitAt n) $ do x <- m  /=  do x <- zoom (splitAt n) m
                      f x            zoom (splitAt n) (f x)

Limiting one parser to n elements is not the same as limiting its two sub-parsers to n elements each. So if you use pipes-parse lenses you cannot rely on zoom being a monad morphism when doing equational reasoning. This was a tough call for me to make, but I felt that delimiting parsers were more important than the monad morphism laws for zoom. Perhaps there is a more elegant solution that I missed that resolves this conflict, but I'm still pleased with the current solution.

Generality

Notice how all of these functions and laws are completely pipes-independent. Any streaming abstraction that has some Producer-like type can implement lens-based parsing, too, and also get all of these laws for free, including pure streams (such as strict or lazy Text) or even lazy IO. Other streaming libraries can therefore benefit from this exact same trick.

Batteries included

Downstream libraries have been upgraded to use the pipes-parse API, including pipes-bytestring, pipes-binary, and pipes-attoparsec.This means that right now you can do cool things like:

import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import Pipes.Binary
import qualified Pipes.ByteString as ByteString

parser :: Parser ByteString IO ()
parser = zoom (ByteString.splitAt 100 . decoded) $ do
    x <- draw  -- Draw a decoded Int
    lift $ print (x :: Maybe Int)
    unDraw 99  -- This undraws the encoded bytes

producer :: Monad m => Producer ByteString m ()
producer = for (each [(1::Int)..]) encode

The above parser composes two lenses so that it zooms in on a stream of decoded ints that consume no more than 100 bytes. This will transmute draw to now receive decoded elements and unDraw will magically re-encode elements when pushing back leftovers:

>>> producer' <- execStateT parser producer
Just 1
>>> evalStateT parser producer'
Just 99

Also, Michael Thompson has released a draft of pipes-text on Hackage. This means you can parse a byte stream through a UTF-8 lens and any undrawn input will be encoded back into the original byte stream as bytes. Here is an example program show-casing this neat feature:

-- decode.hs

import Data.ByteString (ByteString)
import Data.Text       (Text)
import Lens.Family.State.Strict (zoom)
import Pipes
import Pipes.Parse
import qualified Pipes.ByteString as ByteString
import qualified Pipes.Text       as Text

-- Retrieve all `Text` chunks up to 10 characters
parser :: Monad m => Parser ByteString m [Text]
parser = zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll

main = do
    (textChunks, leftovers) <- runStateT parser ByteString.stdin
    print textChunks

    -- Now print the remaining `ByteString` chunks
    byteChunks <- evalStateT drawAll leftovers
    print byteChunks

The unused bytes from the decoded stream get correctly undrawn to the original byte stream!

$ ./decode
Hello, 世界!!!<Enter>
["Hello, \19990\30028!"]
abcdefg<Enter>
<Ctrl-D>
["!!\n","abcdefg\n"]
$

The remainder of the first line is undrawn by the decoder and restored back as the original encoding bytes.

Note that the above example is line-buffered, which is why the program does not output the Text chunks immediately after the 10th input character. However, if you disable line buffering then all chunks have just a single character and the example wouldn't illustrate how leftovers worked.

The above example could have also been written as a single Parser:

parser :: Parser ByteString IO ()
parser = do
    texts <- zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll
    lift (print texts)
    bytes <- drawAll
    lift (print bytes)

main = evalStateT parser ByteString.stdin

... but I wanted to make the leftover passing explicit to emphasize that the leftover behavior holds correctly whether or not you exit and re-enter pipes.

Conclusion

The pipes-parse API lets you propagate leftovers upstream, encode leftover support in the type system, and equationally reason about code with several theoretical laws. Additionally, pipes-parse reuses existing functions and concepts from lens and StateT rather than introducing a new set of abstractions to learn.

Note that pipes-parse used to have a some FreeT-based operations as well. These have been moved to a separate pipes-group library (and upgraded to use lenses) since they are conceptually orthogonal to parsing and I will blog about this library in a separate post.

You can find pipes-parse on Hackage or Github, and it comes with an extended tutorial.

No comments:

Post a Comment