Wednesday, October 31, 2012

pipes-2.5: Faster and slimmer

Introduction


I optimized the entire pipes library very aggressively for version 2.5, and now the library runs faster than conduit on my micro-benchmarks. I'll begin with the purest benchmark which gives the greatest difference in speed since it only measure the efficiency of each library's implementation without any IO bottlenecks:
import Control.Proxy
import Data.Conduit
import Data.Conduit.List as L

n = 1000000 :: Int

-- Pipes
main = runProxy $ discard <-< enumFromToS 1 n

-- Conduit
main = L.enumFromTo 1 n $$ L.sinkNull
Note some differences from last time. This time I'm using conduit's built-in optimized discard equivalent: sinkNull. Also, I've multiplied n by 10 to more accurately measure throughput. I compile both implementations with -O2.

pipes now spends about 112 ns per round-trip:
real    0m0.112s
user    0m0.104s
sys     0m0.004s
... while conduit spends about 167 ns per round-trip:
real    0m0.167s
user    0m0.156s
sys     0m0.008s
I achieved this speed increase by reverting to the pipes-1.0 trick of making the base monad optional, at the expense of breaking the monad transformer laws. I spent a considerable amount of effort trying to get the correct version to work, but I was led inexorably to the same conclusion that Michael already reached, which was that the original approach was best and that the gain in performance is worth bending the monad transformer laws.

Note that the above benchmark exaggerates the differences and is not indicative of real-world performance differences. For typical code you will not observe measurable differences between pipes and conduit when IO is the bottle-neck.

There is also one area in which conduit may still give (very slightly) better performance, which is in speeding up user-defined pipes. One goal I did not complete for this release was copying Michael's trick of using rewrite RULES to inline the Monad instance for user-defined pipes. I plan to copy this same trick in a separate release because I want to take the time to ensure that I can get the rewrite RULES to always fire without interfering with other optimizations.


Light-weight


The big focus of this release was to make pipes a very light-weight dependency, both in terms of performance and transitive dependencies. In the rewrite I dropped the free dependency so now the package only has two non-base dependencies:
  • transformers >= 0.2.0.0
  • index-core
... and I plan on dropping index-core along with Frames once I complete my resource management solution, leaving just a transformers dependency, which is about as light-weight as it could possibly get.

I also stole a page from Michael's book, by removing the -O2 flag from the pipes.cabal file. This flag no longer has any effect on performance after the rewrite, so you should see quicker compile times, making the pipes dependency even lighter.

There is still one other way I could make the pipes dependency even lighter, which is to remove the MFunctor class. The Control.MFunctor module requires Rank2Types, which might rule out pipes for projects that use non-GHC compilers, so if this is an issue for you, just let me know and I will try to migrate MFunctor to a separate library. Frames use a lot of extensions, but those will be on the way out, leaving behind just FlexibleContexts and KindSignatures, which are very mild extensions.


Resource management


I also wanted to use this update to point out that you can get deterministic resource management with pipes today if you use Michael's ResourceT in the base monad. So if you want to use pipes and all you care about is resource determinism then you can switch over already.

However, that alone will NOT give you prompt finalization and if you want promptness you will have to wait until I complete my own resource management extension. The extension I have in mind will be released as a proxy transformer that you can layer in any proxy transformer stack, so any proxy code you currently write can be transparently upgraded to work with resource management later on when I release the extension.

Another thing I want to mention is that while I will release the tools to manage resources promptly and deterministically, I do not plan on using these tools in the proxy standard libraries that I will release. The main reason for this is that:
  • There is no one true solution to finalization and I don't want people to have to buy in to my finalization approach to use the standard libraries I provide.
  • Most people I've talked to who care about finalization usually take the initiative write their own higher-level abstractions on top of whatever finalization primitives I provide them.
So my plan is that the standard libraries I will write will purely focus on the streaming part of the problem and leave initialization/finalization to the end user, which they can optionally implement using my resource management solution or whatever other approach they prefer (such as ResourceT, for example).

If you want to know what I personally use in my own projects at the moment, I just use the following pattern:
do withResource $ \h -> 
   runProxy $ ... <-< streamFromResource h
This gives "good enough" behavior for my purposes and out of all the finalization alternatives I've tried, it is by far the easiest one to understand and use.

The other reason I'm trying out this agnostic approach to finalization is due to discussion with Gregory Collins about his upcoming io-streams library, where he takes a very similar approach to the one I just described of leaving initialization/finalization to the end user to avoid cross-talk between abstractions and to emphasize handling the streaming aspect correctly.


Goals for the near future


I focused on improving the performance of pure code because I plan to release bytestring/text standard libraries and their corresponding parsing proxy transformers very soon, which demand exceptional pure performance. The goal of the upcoming proxy-based parsing libraries is not to beat attoparsec in speed (which I'm reasonably sure is impossible), but rather to:
  • Interleave parsing with with effects
  • Provide a low-memory streaming parser by allowing the user to selectively control backtracking
  • Still be really fast
The first two features are sorely missing from attoparsec, which can't interleave effects and always backtracks so that the file isn't cleared from memory until the parsing completes. For my own projects I need the second feature the most because I get a lot of requests to parse huge files (i.e. 20 GB) that do not fit in my computer's memory. More generally, I want pipes to be the fallback parser of choice for all problems that attoparsec does not solve.

13 comments:

  1. > I spent a considerable amount of effort trying to get the correct version to work, but I was led inexorably to the same conclusion that Michael already reached, which was that the original approach was best and that the gain in performance is worth bending the monad transformer laws.

    Can you go into more detail about exactly what the difficulty is and what "law bending" had to take place to overcome it?

    ReplyDelete
    Replies
    1. I first described the issue here, back when I was first considering using the free package for pipes:

      https://github.com/ekmett/free/issues/3

      It is not exactly the same, but it is still the same basic idea: The monad transformer laws basically say that you should not be able to distinguish how many calls in a row were made to the base monad. The faster implementation does keep track of how many times you invoke the base monad, which violates the laws.

      When you call runProxy, though, then that information vanishes when it fuses all calls to the base monad together. This is called observational equivalence, which is that the laws are correct when viewed through the lens of runProxy.

      The only way you will notice a law violation is if you apply some function that counts the number of calls to the base monad.

      If you are worried about code breaking, that will not happen if you don't assume the monad transformer laws are correct. I would have preferred to use a separate function to replace lift and make clear the laws do not hold, but I believe users would not like that.

      Also, even if you DO assume the monad transformer laws hold, it still won't break if you don't use any functions that distinguish how many tines you invoke the base monad. However, I would prefer that users just don't assume the laws hold in order to be safe.

      Delete
    2. Is this something you are certain there is no solution to, or is it possible there is a nice fix?

      Delete
    3. The issue isn't so much that the binds are intrinsically expensive, but rather that they interfere with a LOT of compiler optimizations since the compiler doesn't know about the monad laws or the monad transformer laws. For example, in the correct version I would end up having to return a value in the base monad like:

      let p1' = return x

      ... and consequently rebind the result in the next step of composition, and the compiler doesn't know that binding a return's result is equivalent to just a pure let binding, so a lot of optimizations just disappear.

      The monad transformer also gets in the way in a lot of other respects, too. In the non-monad-transformer version I can very aggressively rewrite composition into a very tight and compact form that generates really excellent core. With the monad transformer version, even if I take advantage of as many laws as possible I cannot achieve anywhere near as compact and simple of an implementation by hand.

      I also tried lots of rewrite rule tricks to try to help the compiler along and try to do this kind of work automatically, but getting them to reliably fire is very difficult and even when they did work it still didn't help that much.

      The best solution is to offer both implementations and let the user choose whether they want monad transformer laws or speed. However, I can't do that until I can type-class the utilities and that requires some form of polymorphic constraints. I did a lot of work with Edward's constraints library and other tricks to try to type-class my utilities, but that was not working for various reasons that I am still trying to figure out and I felt I needed to move on to fleshing out the standard libraries for now.

      Delete
    4. Thanks for the in-depth explanation!

      Delete
  2. I'm also curious about the specifics of the monad transformer law bending. A link to read up on it would be great.

    It sounds likes pipes and conduit are moving closer together in many ways, I'm curious what are the ways they are still different - the resource finalization being abstracted away is obviously one, and a good one - but I'm curious about where else they still differ.

    ReplyDelete
    Replies
    1. The main advantage of pipes (I really should rename it to proxies at this point) are:

      * Proxies let you communicate bidirectionally
      * The API is small: only one way to compose them
      * Careful to attention to laws to avoid bugs from corner cases.
      * An extension framework that doesn't change the user-facing API.

      That last one is actually quite significant and probably the most underappreciated feature of the library. For example, using the EitherP extension you can throw and catch errors locally within a pipe without disturbing the rest of the pipeline (see the `pipes-2.4` release post for an example), something that no other iteratee implementations can do.

      The advantage of this approach will be more apparent when I release the parsing extension that lets you interleave arbitrary effects with parsing, something that other iteratee implementations cannot do. What they do is simply convert an attoparsec parser into a consumer rather than let you mix parsing commands with other effects. However, I will probably also provide that feature, too, just for completeness.

      All of these extensions don't change the composition operation or request/respond commands. For example, you'll notice that conduit provides the $$+/$$++/$$- operations for dealing with leftovers. With proxies, the `StateP` extension handles all leftover state for you correctly, guaranteeing that it is never dropped without any additional operators (you still use <-< for composition). I also will release a push-back-specific proxy transformer that is just a glorified wrapper around StateP [a] that makes it easier for users to understand what is going on and how pushback works with proxies.

      A lot of features that conduit provides can be replicated in proxies, but I haven't had the time to explain how yet, mainly because I'm focusing on getting the basic standard libraries out. Once the standard libraries are out I will post a "proxy cookbook" that explains the one true way to do everything you are used to doing with conduit, like folding or pushback or parsing. Until then, I mainly field questions from users and explain to them how to do things, so if you have any specific questions for how to do a particular task using proxies just let me know.

      There is one feature that I also plan to release for pipes-3.0 that will also vastly simplify the API, which is to improve the way I structure all the type synonyms. Once I do this, you will be able to use the Pipe type synonyms from Control.Proxy.Pipe transparently with proxy transformers so that you can still get the nicer unidirectional API to play nice with everything else. However, changing the type synonyms to fix this is a sufficiently breaking change that I'm delaying it until I bump the major version number.

      Delete
    2. Oh, I forgot to mention that I answered the question about bending the monad transformer laws in response to the comment before you, so just read that. Long story short, this won't affect you if you don't mind that the internal implementation keeps track of how many times you invoked the base monad.

      Delete
    3. Cool, thanks. Good stuff, looking forward to more.

      Delete
  3. I agree that benchmarks which perform no IO are useful for comparing the overheads of libraries, but I often wonder how well they translate to real-world (i.e. IO-performing) code. In particular, I suspect GHC's optimizer does a much better job of unrolling/inlining/etc. pure code, so I wouldn't be surprised if there are some major differences in the generated core of IO-performing code and non-IO code. I've seen this in the past, but haven't checked recently.

    Partly because of this, I think it's important to focus on the performance of IO-based code. This is especially true as one of the major benefits of these libraries, deterministic interleaving of effects, isn't relevant to non-IO code, so a user can often write traditional, lazy functions in those cases (or more likely, already has lazy functions that can be used directly).

    I'm also not entirely sure what you mean when you state that other iteratee implementations can't implement arbitrary effects interleaved with parsing. In iteratee at least, an iteratee *is* a parser, and can be interleaved freely with other arbitrary effects. This has been true since the beginning. I'm probably misunderstanding your argument here.

    ReplyDelete
    Replies
    1. Yeah, the parsing thing was a complete oversight. I've been so focused on comparisons to conduit lately that I forget about the rest of the field. I'll fix it after I post this comment.

      You are mostly right about IO, which is a sufficiently large bottle-neck that the pure differences don't matter. In fact, that was my original justification for switching to the transformers version, because I thought that the differences were negligible for IO code. However, there were more reasons I've been concentrating on improving pure performance lately.

      The first important reason is that I would like this library to replace lazy bytestrings and lazy text, even the purely-generated kind. For example, sometimes you want to purely generate a bytestring but you don't want to bring the entire bytestring into memory immediately, so you use a lazy bytestring. You could in principle use the already-existing lazy bytestrings for this performance (since it is safe, no lazy IO is involved), but then you would have to use a separate data type for bytestrings generated by IO and bytestrings generated by pure computations. I would like pipes to be used everywhere one needs a generator of any sort, pure or not.

      The second reason is that I want people to feel as comfortable about performance as possible, so comfortable that they use proxies pervasively. I consider the underlying abstraction of proxies to be a very very fundamental building block that should play a central role in almost all programs. This is why I don't contaminate the core implementation with cross-cutting domain-specific concerns like state/error-handling/finalization/parsing and instead decompose those features out as extensions, because I see the underlying abstraction as being much more general in scope than what people envisioned for it and I think its use should be completely ubiquitous in modern Haskell programming, even for completely pure computations.

      I encounter these pure applications of the pipe abstraction all the time, but just haven't blogged about them because all of my efforts right now are on making the library as feature complete as possible before I begin promoting its use really heavily.

      Delete
  4. Will you release an http client that uses pipes underneath?

    ReplyDelete
    Replies
    1. Yes, but probably not before the end of the year. The highest priority libraries for me at the moment are bytestring/text support and parsing.

      Jeremy Shaw is working on a pipes-http server, but he is waiting on the parsing library which would make his life a lot easier. The parsing extension is not rocket science, and I already demonstrated how it works in principle in the pipes-2.4 announcement post. It's just a matter of writing it up and polishing it.

      Also, I just released pipes-3.0, which will be the base library for all these expansion libraries, so please check that out. It makes everything a lot cleaner.

      Delete