Sunday, June 2, 2013

pipes-parse-1.0.0: Pushback, delimited parsers, resumable parsing, and lenses

pipes-parse is finally out! pipes users know that pipes has lagged behind conduit and io-streams in the parsing arena and this library provides the utilities necessary to close the gap. You can find the pipes-parse library here, and I recommend reading the tutorial. This post will mainly discuss the development of pipes-parse and compare it to parsing solutions from other streaming libraries.


End of Input


pipes-parse copies both io-streams and conduit for the end of input protocol: wrap values in Just and end with a stream of Nothings. There are two ways you can modify an input stream to obey this protocol.

The first approach is to use the wrap function, which enforces this protocol:
wrap :: (Monad m, Proxy p)
     => p a' a b' b m r -> p a' a b' (Maybe b) m s
Then you can just write:
wrap . producer >-> consumer
wrap proves its termination safety by having a polymorphic return value (because it ends with a never-ending stream of Nothings).

The second approach is to rewrite your producer as an input stream in the style of io-streams (see this post for details):
source' :: (Monad m, Proxy p) => () -> Session p m (Maybe a)
... and use request composition to connect the producer:
source' \>\ consumer
This approach proves its termination safety by virtue of using request composition. The composition operator specializes to:
(\>\) :: (Monad m, Proxy p)
      => (() -> Session  p           m (Maybe a))
      -> (() -> Consumer p (Maybe a) m        b )
      -> (() -> Session  p           m        b )
The composite pipe's return value only derives from the downstream pipe (i.e. consumer in this case). This is because request composition is automatically safe against termination from the upstream pipe. In the above example, source' just replaces every request within consumer and if source' terminates all that means is that the request completes.

What's nice is that both approaches are 100% compatible with each other. You, the pipe writer, do not need to anticipate which way users will supply input. You just write a pipe that consumes values of type Maybe a and both of the above approaches will work with your pipe. Also, both of these approaches guarantee that you can return values directly from the downstream pipe without guarding the return value with a Maybe.


Pushback and Leftovers


pipes implements pushback using the StateP proxy transformer:
-- Like @request ()@, except try to use the leftovers buffer first
draw :: (Monad m, Proxy p)
     => StateP [a] p () (Maybe a) y' y m (Maybe a)

-- Push an element back onto the leftovers buffer 
unDraw :: (Monad m, Proxy p)
       => a -> StateP [a] p x' x y' y m ()
This is a great example of how the proxy transformer system makes it easy to extend pipes with new features without baking them into the core implementation. I can use the (newly-fixed) StateP proxy transformer to add a leftovers buffer that draw and unDraw both use.

Pushback is where pipes-parse significantly improves on the competition. To motivate the pipes-parse solution, consider the type for conduit's most general composition operator:
(>+>) :: Monad m
      => Pipe l    a b r0 m r1
      -> Pipe Void b c r1 m r2
      -> Pipe l    a c r0 m r2
--            ^
--            |
--            +-- Leftovers
The downstream conduit cannot provide leftovers because they will be lost after composition. With pipes-parse you can save leftovers from both composed pipes very easily. To see how, imagine we have the following two pipe types:
p1  :: (Monad m, Proxy p)
    => () -> Pipe (StateP [a] p) (Maybe a) (Maybe b) m r
p2  :: (Monad m, Proxy p)
    => () -> Pipe (StateP [b] p) (Maybe b) (Maybe c) m r
--                         ^
--                         |
--                         +-- Leftovers
Each of these pipes stores a leftovers buffer equal to its input type, but we can't yet compose these pipes because their leftovers buffers don't match. However, pipes-parse provides lens support in the form of the zoom function so that you can easily unify two leftovers buffers in order to compose them:
zoom _fst . p1
    :: (Monad m, Proxy p)
    => () -> Pipe (StateP ([a], [b]) p) (Maybe a) (Maybe b) m r
zoom _snd . p2
    :: (Monad m, Proxy p)
    => () -> Pipe (StateP ([a], [b]) p) (Maybe b) (Maybe c) m r

zoom _fst . p1 >-> zoom _snd . p2
    :: (Monad m, Proxy p)
    => () -> Pipe (StateP ([a], [b]) p) (Maybe a) (Maybe c) m r
But you can do more than that! You can still access the leftovers buffers afterwards, too, again using zoom:
example = do
    (zoom _fst . p1 >-> zoom _snd . p2) ()
    -- Draw, reusing the leftovers from @p1@
    ma <- zoom _fst draw
    -- Retrieve the leftovers from @p2@
    mb <- zoom _snd get
    ...
This kind of multiple-buffer management isn't possible using conduit.

zoom is a perfect example of the functor design pattern. We lift two existing proxies to agree on a common global state for compatibility purposes. Therefore, we expect that three should be functor laws at play:
zoom id = id

zoom (f . g) = zoom f . zoom g

pipes also improves upon io-streams pushback, too. With io-streams all the push-back is done using IORefs, meaning that:
  • It isn't pure
  • You can't easily control which streams share leftovers and which ones do not
  • None of the state is reflected in the types
With pipes-parse you get pure and precise control over leftovers. Moreover, you do not need to instrument streams to correctly forward values that you push back upstream, because StateP abstracts over that for you.


Nesting and delimiting parsers


Like other streaming libraries, pipes-parse makes it very easy to run a parser on a subset of the stream. This was probably the #1 feature requested, followed shortly by...


Resumable parsing


pipes-parse uses StateP, so if you want to interrupt parsing you can just use runStateK to return the current state of the leftovers for use in a later computation. Simple!


Perfect streaming


One of the more advanced features to come out of the last wave of development was what I like to call "perfect streaming". This has a very specific meaning: grouping the input and interacting with each group as a stream without bringing more than one chunk into memory.

For example, consider the following conduit:
lines :: Monad m => Conduit ByteString m ByteString
This will load each line into memory, which means that if your file is one long line then you will load the entire file into memory, defeating the purpose of streaming! io-streams has the same problem, but, unlike conduit, io-streams can easily fix its lines utility to stream perfectly and I plan to show Greg how to do this so that io-streams users can benefit from the same trick.

pipes-parse does not teach how to use this trick, but it does lay the groundwork for it and the upcoming pipes-bytestring library will provide examples of this idiom. If you want to see a concrete example of this trick in action, check out Oliver Charles's upcoming pipes-tar library on Github to see a preview of this idiom, where he streams individual files from a TAR archive without ever loading more than one chunk in memory. His very interesting use case was the inspiration for this trick, and I also preview this idiom in this Stack Overflow answer.

More generally, perfect streaming uses the respond category's composition operator, which has the following general type:
(/>/) :: (Monad m, Proxy p)
      => (a -> p x' x b' b m a')
      -> (b -> p x' x c' c m b')
      -> (a -> p x' x c' c m a')
When you use respond composition, both pipes share the same upstream interface meaning that you can group the input into subsections but still allow each subsection to access the original upstream interface. With appropriate information hiding you can set up pipes which behave like lenses to specific subsections of the stream and allow the user to stream from each subsection independently.


Compatibility


pipes-parse takes great care to ensure that non-parsing pipes are completely interoperable with parsing pipes, thanks to the following compatibility functions:
fmapPull
    :: (Monad m, Proxy p)
    => (x -> p x        a  x        b  m r)
    -> (x -> p x (Maybe a) x (Maybe b) m r)

returnPull
    :: (Monad m, Proxy p)
    => x -> p x a x (Maybe a) m r

bindPull
    :: (Monad m, Proxy p)
    => (x -> p x        a  x (Maybe b) m r)
    -> (x -> p x (Maybe a) x (Maybe b) m r)
These three functions define functors and monads in the category where the objects are the downstream components of each proxy interface and the morphisms are pull-based pipes.

As you might guess, fmapPull satisfies the functor laws:
fmapPull f >-> fmapPull g = fmapPull (f >-> g)

fmapPull pull = pull
Similarly, returnPull and bindPull satisfy the monad laws:
-- Using: f >>> g = f >-> bindPull g

returnPull >>> f = f

f >>> returnPull = f

(f >>> g) >>> h = f >>> (g >>> h)
... which are equivalent to:
returnPull >-> bindPull f = f

bindPull returnPull = pull

bindPull (f >-> bindPull g) = bindPull f >-> bindPull g
... and we can derive the functor from the monad:
fmapPull f = bindPull (f >-> returnPull)

-- i.e. fmap f = (>>= return . f)
These functions could replace Maybe and parametrize it with a type class like:
class FunctorPull f where
    fmapPull
    :: (Monad m, Proxy p)
    => (x -> p x    a  x    b  m r)
    -> (x -> p x (f a) x (f b) m r)
... and there is a sensible instance for Either, too (in fact, that's how rightD from the pipes prelude works). However, I decided to keep them monomorphic for now for simplicity.


Conclusion


pipes-parse, like most pipes libraries, keeps the best spirit of Haskell programming by:
  • composing features from smaller, simpler, and correct building blocks,
  • using higher-order functions to lift existing functions for compatibility,
  • isolating features from each other to statically prevent accidental complexity
pipes-parse is the last of the three core libraries, the other two being pipes-safe and pipes-concurrency. These libraries define the central idioms for the pipes ecosystem and they were all designed to be instructive and convention-setting in areas where there isn't a perfectly elegant solution and some pragmatic trade-offs had to be made.

The completion of these libraries marks the point where I feel the core pipes API has proven itself to be sufficiently versatile and future-proof. The proxy transformer system makes the central API unusually stable because I don't need to bake in any new features that I want to add.

This means I will be upgrading pipes to version 4.0 soon to mark the transition to a stabler API in preparation for eventual inclusion in the Haskell platform. Also, most development work will shift to derived libraries now.

That does not mean that the derived libraries are complete, yet. For example, I am currently writing up a pipes-safe-2.0.0 which will feature improved promptness guarantees and eliminate the need for unsafe finalization primitives. Similarly, I am about to release a pipes-safe-1.2.0 at the end of this week which will add broadcasts and continuous behaviors. More generally, I will only consider the derived core libraries to be mature when more code is built on top of them on the scale of what conduit has right now.

The next library on the development docket is pipes-bytestring. Now that pipes-parse is complete I feel much more comfortable about the stability of pipes-bytestring API. Also, pipes now has an official mailing list where you can ask questions, follow pipes development, or offer feedback and contribute to upcoming pipes libraries.

No comments:

Post a Comment