tag:blogger.com,1999:blog-17779909838478118062014-04-15T10:19:00.640-07:00Haskell for allGabriel Gonzaleznoreply@blogger.comBlogger66125tag:blogger.com,1999:blog-1777990983847811806.post-73999086784962017692014-04-04T20:33:00.000-07:002014-04-04T20:33:04.797-07:00Scalable program architectures<p>Haskell design patterns differ from mainstream design patterns in one important way:</p><ul class="incremental"><li><p><strong>Conventional architecture:</strong> Combine a several components together of type <code>A</code> together to generate a "network" or "topology" of type <code>B</code></p></li><li><p><strong>Haskell architecture:</strong> Combine several components together of type <code>A</code> to generate a new component of the same type <code>A</code>, indistinguishable in character from its substituent parts</p></li></ul><p>This distinction affects how the two architectural styles evolve as code bases grow.</p><p>The conventional architecture requires layering on abstraction on top of abstraction:</p><blockquote><p>Oh no, these <code>B</code>s are not connectable, so let's make a network of <code>B</code>s and call that a <code>C</code>.</p><p>Well, I want to assemble several <code>C</code>s, so let's make a network of <code>C</code>s and call that a <code>D</code></p><p>...</p></blockquote><p>Wash, rinse, and repeat until you have an unmanageable tower of abstractions.</p><p>With a Haskell-style architecture, you don't need to keep layering on abstractions to preserve combinability. When you combine things together the result is still itself combinable. You don't distinguish between components and networks of components.</p><p>In fact, this principle should be familiar to anybody who knows basic arithmetic. When you combine a bunch of numbers together you get back a number:</p><pre><code>3 + 4 + 9 = 16</code></pre><p>Zero or more numbers go in and exactly one number comes out. The resulting number is itself combinable. You don't have to learn about "web"s of numbers or "web"s of "web"s of numbers.</p><p>If elementary school children can master this principle, then perhaps we can, too. How can we make programming more like addition?</p><p>Well, addition is simple because we have <code>(+)</code> and <code>0</code>. <code>(+)</code> ensures that we can always convert <strong>more than</strong> one number into exactly number:</p><pre><code>(+) :: Int -> Int -> Int</code></pre><p>... and <code>0</code> ensures that we can always convert <strong>less than</strong> one number into exactly one number by providing a suitable default:</p><pre><code>0 :: Int</code></pre><p>This will look familiar to Haskell programmers: these type signatures resemble the methods of the <code>Monoid</code> type class:</p><pre><code>class Monoid m where<br /> -- `mappend` is analogous to `(+)`<br /> mappend :: m -> m -> m<br /><br /> -- `mempty` is analogous to `0`<br /> mempty :: m</code></pre><p>In other words, the <code>Monoid</code> type class is the canonical example of this Haskell architectural style. We use <code>mappend</code> and <code>mempty</code> to combine 0 or more <code>m</code>s into exactly 1 <code>m</code>. The resulting <code>m</code> is still combinable.</p><p>Not every Haskell abstraction implements <code>Monoid</code>, nor do they have to because category theory takes this basic <code>Monoid</code> idea and generalizes it to more powerful domains. Each generalization retains the same basic principle of preserving combinability.</p><p>For example, a <code>Category</code> is just a typed monoid, where not all combinations type-check:</p><pre><code>class Category cat where<br /> -- `(.)` is analogous to `(+)`<br /> (.) :: cat b c -> cat a b -> cat a c<br /><br /> -- `id` is analogous to `0`<br /> id :: cat a a</code></pre><p>... a <code>Monad</code> is like a monoid where we combine functors "vertically":</p><pre><code>-- Slightly modified from the original type class<br />class Functor m => Monad m where<br /> -- `join` is analogous to `(+)`<br /> join :: m (m a) -> m a<br /><br /> -- `return` is analogous to `0`<br /> return :: a -> m a</code></pre><p>... and an <code>Applicative</code> is like a monoid where we combine functors "horizontally":</p><pre><code>-- Greatly modified, but equivalent to, the original type class<br />class Functor f => Applicative f where<br /> -- `mult` is is analogous to `(+)`<br /> mult :: f a -> f b -> f (a, b)<br /><br /> -- `unit` is analogous to `0`<br /> unit :: f ()</code></pre><p>Category theory is full of generalized patterns like these, all of which try to preserve that basic intuition we had for addition. We convert <strong>more than</strong> one <em>thing</em> into exactly one <em>thing</em> using something that resembles addition and we convert <strong>less than</strong> one <em>thing</em> into exactly one <em>thing</em> using something that resembles zero. Once you learn to think in terms of these patterns, programming becomes as simple as basic arithmetic: combinable components go in and exactly one combinable component comes out.</p><p>These abstractions scale limitlessly because they always preserve combinability, therefore we never need to layer further abstractions on top. This is one reason why you should learn Haskell: you learn to how to build flat architectures.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com23tag:blogger.com,1999:blog-1777990983847811806.post-18492224426832356202014-04-01T04:49:00.000-07:002014-04-01T04:49:21.697-07:00Worst practices are viral for the wrong reasons<p>This short post describes my hypothesis for why poor software engineering practices are more viral. Simply stated:</p><blockquote><p>Worst practices spread more quickly because they require more attention</p></blockquote><p>The following two sections outline specific manifestations of this problem and how they promote the dissemination of worst practices.</p><h4 id="management">Management</h4><p><em>Corollary 1: Teams with the greatest technical debt mentor the most employees.</em></p><p><strong>Scenario 1:</strong> You manage a team that ships an important feature but with a poor execution. You can make a compelling case to management that your team needs more head count to continue supporting this feature. More employees reporting to you increases your chances of promotion and you enjoy rewarding opportunities to mentor new software engineers in your specific brand of programming.</p><p><strong>Scenario 2:</strong> You manage a team that ships an important feature in the exact same amount of time but with an amazing execution. Your code requires little maintenance, so little in fact that your team is now obsolete. If you're lucky you are retasked to work on a new project; if you're unlucky you are fired because your skills are no longer relevant. Training others how to write software of excellent quality is out of the question either way.</p><p>In other words, software engineers that produce excellent code cannot easily pass on their wisdom to the next generation of programmers unless they go into academia.</p><h4 id="open-source-projects">Open source projects</h4><p><em>Corollary 2: Poorly implemented libraries or programming languages generate more buzz.</em></p><p><strong>Scenario 1:</strong> You write a useful library or programming language, but it is incredibly buggy and problematic. This generates several scathing blog posts criticizing your work, prompting others to clarify in response how to work around those issues. Don't forget to check Stack Overflow, which is full of people asking for help regarding how to use your project. Did you know that people measure popularity in terms of Stack Overflow questions?</p><p><strong>Scenario 2:</strong> You release a useful library or programming language at the same time, except that through some miracle you get it correct on the first try. Version 1.0 is your first and last release. That means you can't advertise your library through announcement posts for subsequent releases and there are only so many ways that you can say "it works" before people accuse you of shameless advertising.</p><p>In other words, well-written open source projects can become victims of their own success, with fewer opportunities to market themselves.</p><h4 id="conclusion">Conclusion</h4><p>Obviously the above examples are exaggerated hyperboles and reality is somewhere in the middle. There is no perfect software project or team and there is a limit to how much crap a company or user will tolerate.</p><p>However, I write this because I value simple and understandable software and I want to start a discussion on how software engineering can culturally reward correctness and reliability. We need to think of ways to encourage programming excellence to spread.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com11tag:blogger.com,1999:blog-1777990983847811806.post-63979716092900649272014-03-25T00:18:00.000-07:002014-03-25T00:18:52.660-07:00Introductions to advanced Haskell topics<p>Many people bemoan the sharp divide between experts and beginning Haskell programmers. One thing I've noticed is that "advanced" Haskell topics all have one thing in common: there exists only one good tutorial on that topic and only the experts have found it. This post is a collection of links to what I consider to be the definitive tutorials on these topics.</p><h4 id="monads">Monads</h4><p>Monads are technically not an advanced Haskell topic, but I include this topic in the list because the vast majority of monad tutorials are terrible and/or misleading. In my opinion, the <a href="http://blog.sigfpe.com/2006/08/you-could-have-invented-monads-and.html">best monad tutorial (Titled: "You could have invented monads")</a> was written 8 years ago by Dan Piponi and nobody has ever improved on it since.</p><h4 id="monad-transformers">Monad Transformers</h4><p>One thing that perpetually boggles me is that the blogosphere has almost no posts explaining how to use monad transformers. In fact, the only good explanation I know of is a <a href="http://www.cs.virginia.edu/~wh5a/personal/Transformers.pdf">very simple paper by Martin Grabmuller (Titled: "Monad Transformers - Step by Step")</a>. Most beginners will skip over an academic paper like this when searching for tutorials since most people don't associate the word "academic" with beginner-friendly. However, don't let that fool you: this paper is very well written and easy to read.</p><h4 id="parsing">Parsing</h4><p>Outsiders comment that monads only exist to paper over Haskell's weaknesses until they discover monadic parsers. The truly mind-blowing moment is when you learn how simple they are to implement when you read Hutton and Meijer's <a href="http://www.cs.uwyo.edu/~jlc/courses/3015/parser_pearl.pdf">functional pearl on monadic parsing (Titled: "Monadic Parsing in Haskell")</a>. They walk through how to implement a backtracking parser, how to implement the <code>Monad</code> type class for this parser, and how to define several parsing primitives and chain them together into more sophisticated functionality. After you read this functional pearl you will never doubt the significance of monads ever again.</p><p>You also want to understand parser combinators for another reason: Haskell parser combinator libraries are light-years ahead of Haskell regular expression libraries. This confuses many people who come to Haskell from a language that uses regular expressions heavily (such as Python or Perl) and they wonder why regular expression support is so bad in Haskell. The reason is that nearly all Haskell programmers use parser combinator libraries such as <a href="hackage.haskell.org/package/parsec">parsec</a> or <a href="http://hackage.haskell.org/package/attoparsec">attoparsec</a>.</p><h4 id="free-monads">Free Monads</h4><p>This is the only case where I will plug a post of my own. I wrote a <a href="http://www.haskellforall.com/2012/06/you-could-have-invented-free-monads.html">tutorial on free monads (Titled: "Why free monads matter")</a> because back when I was trying to understand free monads I could not find any beginner-friendly explanation. Everything on this subject was written for a target audience of mathematicians and would motivate their use cases using mathematical examples.</p><p>I wanted to explain the motivation behind free monads in terms that a programmer would understand and my post explains the topic in terms of creating a syntax tree and interpreting that syntax tree. Understanding free monads will greatly improve your understanding for how to build monadic domain-specific languages within Haskell. Also, many people comment that understanding free monads greatly improves their mental model for Haskell <code>IO</code> as well.</p><h4 id="coroutines">Coroutines</h4><p>Haskell coroutine libraries (like <code>pipes</code>/<code>conduit</code>/<code>machines</code>) have what appear to be impenetrable implementations ... until you read Mario Blazevic's <a href="http://themonadreader.files.wordpress.com/2011/10/issue19.pdf">article on coroutines in issue 19 of The Monad Reader (Titled: Coroutine Pipelines")</a>.</p><p>In that article he introduces the <code>Coroutine</code> type (which is now more commonly known as the free monad transformer) which unifies every coroutine implementation into a single pattern. Once you understand his article, you will understand the internals of coroutine libraries and you will be able to easily write a streaming library of your own.</p><h4 id="lenses">Lenses</h4><p>Lens tutorials are the new monad tutorials, and the one mistake that all lens tutorials make is that none of them clearly explain <strong>how lenses work</strong>. The only post that ever got this right was the original post by Twan van Laarhoven that first introduced the <a href="http://twanvl.nl/blog/haskell/cps-functional-references">modern formulation of lenses (Titled: "CPS Functional references")</a>. The problem is that post is incredibly hard to find even through a dedicated Google search for lens tutorials because the title is obscure and the post only has a single mention of the word "lens". I actually had trouble finding this post even by browsing his blog archive because I didn't recognize the title.</p><p>Honorable mention: The <a href="http://hackage.haskell.org/package/lens-family-core"><code>lens-family-core</code> library</a> is the only lens library that actually hews closely to Twan's original presentation, so I highly recommend that you study its source code to further improve your understanding.</p><h4 id="church-encoding">Church encoding</h4><p>I still can't find a good blog post on Church encoding, so I will not link to anything. However, once you learn how to Church encode a data type you will suddenly see this pattern <strong>everywhere</strong>.</p><p>Equally important, many high-performance Haskell libraries (like <code>attoparsec</code>) use Church encoding for efficiency reasons and understanding Church encoding will allow you to understand their internal implementations.</p><h4 id="conclusion">Conclusion</h4><p>I'm really disappointed that so many critical Haskell topics only have a single source of truth that is difficult to find. I hypothesize that the problem is that Haskell is rooted in computer science academia, which discourages duplicate publications on the same topic. If you want to prove me wrong, then please take the time to increase the number of high quality Haskell tutorials on these and other key topics.</p><p>Another troubling trend is that tendency for people to bandwagon on particular tutorial topics that are sexy (first monads, now lenses). I wish people would spend more time diversifying coverage on more mundane topics, like how to apply well-established libraries to architect various types of applications.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com8tag:blogger.com,1999:blog-1777990983847811806.post-60782223158403310552014-03-03T22:58:00.002-08:002014-03-04T19:25:19.271-08:00How to model handles with pipesI receive repeated questions for how to implement a <tt>Handle</tt>-like API similar to <tt>io-streams</tt> using <tt>pipes</tt> even after I outlined how to do this <a href="http://www.haskellforall.com/2013/04/pipes-and-io-streams.html">about a year ago</a>. However, I don't blame these people because that tutorial uses an out-of-date (and uglier) <tt>pipes-3.2</tt> API and also takes an approach that I felt was a bit inconsistent and missed the mark, so this is a rewrite of the original post, both to upgrade the code to the latest <tt>pipes-4.1</tt> API and to make some key changes to the original idioms to improve conceptual consistency.<br /><br />I usually find that the best way to introduce this <tt>Handle</tt>-like intuition for <tt>pipes</tt> is to make an analogy to <tt>io-streams</tt>, so this post assumes that you are familiar with both <tt>pipes</tt> and <tt>io-streams</tt>. If not, then you can read the <a href="http://hackage.haskell.org/package/pipes-4.1.0/docs/Pipes-Tutorial.html">pipes tutorial</a> or the <a href="http://hackage.haskell.org/package/io-streams-1.1.4.2/docs/System-IO-Streams-Tutorial.html">io-streams tutorial</a>, both of which I wrote! I want to really emphasize that I mainly discuss <tt>io-streams</tt> as a stepping stone for understanding the equivalent <tt>pipes</tt> abstractions, but the purpose is not to reimplement <tt>io-streams</tt> in <tt>pipes</tt>. Rather I want to introduce a more general <tt>IO</tt>-free abstraction that just happens to overlap a bit with <tt>io-streams</tt>.<br /><br /><br /><h4>Types</h4><br />I'll begin with the correspondence between <tt>io-streams</tt> and <tt>pipes</tt> types: <pre><br />import Pipes<br /><br />type InputStream a = Effect IO (Maybe a)<br /><br />type OutputStream a = Maybe a -> Effect IO ()<br /><br />makeInputStream :: IO (Maybe a) -> InputStream a<br />makeInputStream = lift<br /><br />makeOutputStream :: (Maybe a -> IO ()) -> OutputStream a<br />makeOutputStream f = lift . f<br /></pre>For example, the <tt>pipes</tt> analogs of <tt>stdin</tt> and <tt>stdout</tt> from <tt>io-streams</tt> would be: <pre><br />import qualified Data.ByteString as B<br />import qualified System.IO as IO<br />import qualified Data.Foldable as F<br /><br />stdin :: InputStream B.ByteString<br />stdin = makeInputStream $ do<br /> bs <- B.hGetSome IO.stdin 4096<br /> return (if B.null bs then Nothing else Just bs)<br /><br />stdout :: OutputStream B.ByteString<br />stdout = makeOutputStream (F.mapM_ B.putStr)<br /></pre>Notice how this does <b>not</b> convert the read and write actions to repetitive streams. Instead, you just trivially wrap them using <tt>lift</tt>, promoting them to <tt>Effect</tt>. However, for the rest of this post I will use <tt>Effect</tt> instead of <tt>InputStream</tt> and <tt>OutputStream</tt> to avoid confusion between the two libraries.<br /><br />Also, the <tt>pipes</tt> implementations of <tt>makeInputStream</tt> and <tt>makeOutputStream</tt> are pure, whereas their <tt>io-streams</tt> counterparts require <tt>IO</tt>. In fact, <tt>io-streams</tt> fudges a little bit and implements <tt>stdin</tt> and <tt>stdout</tt> in terms of <tt>unsafePerformIO</tt>, otherwise they would have these types: <pre><br />stdin :: IO (InputStream ByteString)<br />stdout :: IO (OutputStream ByteString)<br /></pre><tt>io-streams</tt> is tightly integrated with <tt>IO</tt> (thus the name), but I believe we can do better.<br /><br /><br /><h4>Reading/Writing</h4><br />The <tt>pipes</tt> analogs of the <tt>io-streams</tt> <tt>read</tt> and <tt>write</tt> commands are <tt>await</tt> and <tt>yield</tt>: <pre><br />import Prelude hiding (read)<br /><br />read :: Monad m => Consumer (Maybe a) m (Maybe a)<br />read = await<br /><br />write :: Monad m => Maybe a -> Producer (Maybe a) m ()<br />write = yield<br /></pre>Here are examples of how you might use them: <pre><br />import Control.Applicative<br />import Data.Monoid<br /><br />-- `Consumer (Maybe a) m (Maybe b)` is the analog of<br />-- `InputStream a -> IO (InputStream b)`<br />combine<br /> :: Monad m<br /> => Consumer (Maybe B.ByteString) m (Maybe B.ByteString)<br />combine = do<br /> mstr1 <- read<br /> mstr2 <- read<br /> return $ liftA2 (<>) mstr1 mstr2<br /><br />-- `Maybe a -> Producer (Maybe b) m ()` is the analog of<br />-- `OutputStream b -> IO (OutputStream a)`<br />duplicate :: Monad m => Maybe a -> Producer (Maybe a) m ()<br />duplicate ma = do<br /> write ma<br /> write ma<br /></pre>Compare to <tt>io-streams</tt>: <pre><br />combine :: InputStream B.ByteString -> IO (Maybe B.ByteString)<br />combine is = do<br /> mbs1 <- read is<br /> mbs2 <- read is<br /> return $ liftA2 (<>) mbs1 mbs2<br /><br />duplicate :: OutputStream a -> a -> IO ()<br />duplicate os a = do<br /> write a os<br /> write a os<br /></pre>Unlike <tt>io-streams</tt>, <tt>pipes</tt> does not need to specify the upstream source for <tt>read</tt> commands. This is because we can supply <tt>read</tt> with an input stream using <tt>(>~)</tt>: <pre><br />>>> runEffect $ stdin >~ read<br />Test<Enter><br />Just "Test\n"<br />>>> runEffect $ stdin >~ combine<br />Hello<Enter><br />world!<Enter><br />Just "Hello\nworld!\n"<br /></pre>In fact, the <tt>read</tt> command in the first example is totally optional, because <tt>read</tt> is just a synonym for <tt>await</tt>, and one of the <tt>pipes</tt> laws states that: <pre><br />f >~ await = f<br /></pre>Therefore, we could instead just write: <pre><br />>>> runEffect stdin<br />Test<Enter><br />Just "Test\n"<br /></pre>Dually, we don't need to specify the downstream source for <tt>write</tt>. This is because you connect <tt>write</tt> with an output stream using <tt>(~>)</tt><pre><br />>>> :set -XOverloadedStrings<br />>>> runEffect $ (write ~> stdout) (Just "Test\n")<br />Test<br />>>> runEffect $ (duplicate ~> stdout) (Just "Test\n")<br />Test<br />Test<br /></pre>Again, the <tt>write</tt> in the first example is optional, because <tt>write</tt> is just a synonym for <tt>yield</tt> and one of the <tt>pipes</tt> laws states that: <pre><br />yield ~> f = f<br /></pre>Therefore, we can simplify it to: <pre><br />>>> runEffect $ stdout (Just "Test")<br />Test<br /></pre><br /><h4>Mixed reads and writes</h4><br />Like <tt>io-streams</tt>, we do not need to specify the entire streaming computation up front. We can step these input and output streams one value at a time using <tt>runEffect</tt> instead of consuming them all in one go. However, you lose no power and gain a lot of elegance by writing everything entirely within <tt>pipes</tt>.<br /><br />For example, just like <tt>io-streams</tt>, you can freely mix <tt>read</tt>s and <tt>write</tt>s from multiple diverse sources: <pre><br />inputStream1 :: Effect IO (Maybe A)<br />inputStream2 :: Effect IO (Maybe B)<br /><br />mixRead :: Effect IO (Maybe A, Maybe B)<br />mixRead = do<br /> ma <- inputStream1 >~ read<br /> mb <- inputStream2 >~ read<br /> return (ma, mb)<br /><br />outputStream1 :: Maybe a -> Effect IO ()<br />outputStream2 :: Maybe a -> Effect IO ()<br /><br />mixWrite :: Maybe a -> Effect IO ()<br />mixWrite ma = do<br /> (write ~> outputStream1) ma<br /> (write ~> outputStream2) ma<br /></pre>Oh, wait! Didn't we just get through saying that you can simplify these? <pre><br />-- Reminder:<br />-- f >~ read = f<br />-- write ~> f = f<br /><br />mixRead :: Effect IO (Maybe A, Maybe B)<br />mixRead = do<br /> ma <- inputStream1<br /> mb <- inputStream2<br /> return (ma, mb)<br /><br />mixWrite :: Maybe a -> Effect IO ()<br />mixWrite ma = do<br /> outputStream1 ma<br /> outputStream2 ma<br /></pre>In other words, if we stay within <tt>pipes</tt> we can read and write from input and output streams just by directly calling them. Unlike <tt>io-streams</tt>, we don't need to use <tt>read</tt> and <tt>write</tt> when we select a specific stream. We only require <tt>read</tt> and <tt>write</tt> if we wish to inject the input or output stream later on.<br /><br /><br /><h4>Transformations</h4><br />What's neat about the <tt>pipes</tt> idioms is that you use the exact same API to implement and connect transformations on streams. For example, here's how you implement the <tt>map</tt> function from <tt>io-streams</tt> using <tt>pipes</tt>: <pre><br />map :: Monad m => (a -> b) -> Consumer (Maybe a) m (Maybe b)<br />map f = do<br /> ma <- read<br /> return (fmap f ma)<br /></pre>Compare that to the <tt>io-streams</tt> implementation of <tt>map</tt>: <pre><br />map :: (a -> b) -> InputStream a -> IO (InputStream b)<br />map f is = makeInputStream $ do<br /> ma <- read is<br /> return (fmap f ma)<br /></pre>Some differences stand out: <ul><li> The <tt>pipes</tt> version does not need to explicitly thread the input stream <li> The <tt>pipes</tt> version does not require <tt>makeInputStream</tt><li> The <tt>pipes</tt> transformation is pure (meaning no <tt>IO</tt>) </ul>Now watch how we apply this transformation: <pre><br />>>> runEffect $ stdin >~ map (<> "!")<br />Test<Enter><br />Just "Test!\n"<br /></pre>The syntax for implementing and connecting <tt>map</tt> is completely indistinguishable from the syntax we used in the previous section to implement and connect <tt>combine</tt>. And why should there be a difference? Both of them read from a source and return a derived value. Yet, <tt>io-streams</tt> distinguishes between things that transform input streams and things that read from input streams, both in the types and implementation: <pre><br />iReadFromAnInputStream :: InputStream a -> IO b<br />iReadFromAnInputStream is = do<br /> ma <- read is<br /> ...<br /><br />iTransformAnInputStream :: InputStream a -> IO (InputStream b)<br />iTransformAnInputStream is = makeInputStream $ do<br /> ma <- read is<br /> ...<br /></pre>This means that <tt>io-streams</tt> code cannot reuse readers as input stream transformations or, vice versa, reuse input stream transformations as readers.<br /><br />As you might have guessed, the dual property holds for writers and output stream transformations. In <tt>pipes</tt>, we model both of these using a <tt>Producer</tt> Kleisli arrow: <pre><br />contramap<br /> :: Monad m => (a -> b) -> Maybe a -> Producer (Maybe b) m ()<br />contramap f ma = yield (fmap f ma)<br /></pre>... and we connect them the same way: <pre><br />>>> runEffect $ (contramap (<> "!\n") ~> stdout) (Just "Test")<br />Test!<br /></pre>... whereas <tt>io-streams</tt> distinguishes these two concepts: <pre><br />iWriteToAnOutputStream :: OutputStream b -> Maybe a -> IO ()<br />iWriteToAnOutputStream os ma = do<br /> write os ma<br /> ...<br /> <br />iTransformAnOutputStream :: OutputStream b -> IO (OutputStream a)<br />iTransformAnOutputStream os = makeOutputStream $ \ma -> do<br /> write os ma<br /> ...<br /></pre><br /><h4>End of input</h4><br />None of this machinery is specific to the <tt>Maybe</tt> type. I only used <tt>Maybe</tt> for analogy with <tt>io-streams</tt>, which uses <tt>Maybe</tt> to signal end of input. For the rest of this post I will just drop the <tt>Maybe</tt>s entirely to simplify the type signatures, meaning that I will model the relevant types as: <pre><br />inputStream :: Effect m a<br /><br />outputStream :: a -> Effect m ()<br /><br />inputStreamTransformation :: Consumer a m b<br /><br />outputStreamTransformation :: a -> Producer b m ()<br /></pre>This is also the approach that my upcoming <tt>pipes-streams</tt> library will take, because <tt>pipes</tt> does not use <tt>Maybe</tt> to signal end of input. Instead, if you have a finite input you represent that input as a <tt>Producer</tt>: <pre><br />finiteInput :: Producer a m ()<br /></pre>What's great is that you can still connect this to an output stream, using <tt>for</tt>: <pre><br />for finiteInput outputStream :: Effect m ()<br /></pre>Dually, you model a finite output as a <tt>Consumer</tt>: <pre><br />finiteOutput :: Consumer a m ()<br /></pre>... and you can connect that to an input stream using <tt>(>~)</tt>: <pre><br />inputStream >~ finiteOutput :: Effect m ()<br /></pre>These provide a convenient bridge between classic <tt>pipes</tt> abstractions and handle-like streams.<br /><br /><br /><h4>Polymorphism</h4><br />Interestingly, we can use the same <tt>(>~)</tt> operator to: <ul><li> connect input streams to input stream transformations, and: <li> connect input stream transformations to each other. </ul>This is because <tt>(>~)</tt> has a highly polymorphic type which you can specialize down to two separate types: <pre><br />-- Connect an input stream to an input stream transformation,<br />-- returning a new input stream<br />(>~) :: Monad m<br /> => Effect m a<br /> -> Consumer a m b<br /> -> Effect m b<br /><br />-- Connect two input stream transformations together,<br />-- returning a new input stream transformation<br />(>~) :: Monad m<br /> => Consumer a m b<br /> -> Consumer b m c<br /> -> Consumer a m c<br /></pre>Dually, we can use the same <tt>(~>)</tt> operator to: <ul><li> connect output stream transformations to output streams, and: <li> connect output stream transformations to each other. </ul>Again, this is because <tt>(~>)</tt> has a highly polymorphic type which you can specialize down to two separate types: <pre><br />-- Connect an output stream transformation to an output stream,<br />-- returning a new output stream<br />(~>) :: Monad m<br /> => (a -> Producer b m ())<br /> -> (b -> Effect m ())<br /> -> (a -> Effect m ())<br /> <br />-- Connect two output stream transformations together,<br />-- returning a new output stream transformation<br />(~>) :: Monad m<br /> => (a -> Producer b m ())<br /> -> (b -> Producer c m ())<br /> -> (a -> Producer c m ())<br /></pre>In fact, you will see these specialized type signatures in the documentation for <tt>(>~)</tt> for <tt>(~>)</tt>. Hopefully, the analogy to input streams and output streams will help you see those types in a new light.<br /><br /><br /><h4>Categories</h4><br />People familiar with <tt>pipes</tt> know that <tt>await/read</tt> and <tt>(>~)</tt> form a category, so we get three equations for free from the three category laws: <pre><br />-- Reading something = calling it<br />read >~ f = f<br /><br />-- `read` is the identity input stream transformation<br />f >~ read<br /><br />-- It doesn't matter how you group sinks/transformations<br />(f >~ g) >~ h = f >~ (g >~ h)<br /></pre>Dually, <tt>yield/write</tt> and <tt>(~>)</tt> form a category, so we get another three equations for free: <pre><br />-- Writing something = calling it<br />f >~ write = f<br /><br />-- `write` is the identity output stream transformation<br />write >~ f<br /><br />-- It doesn't matter how you group sources/transformations<br />(f ~> g) ~> h = f ~> (g ~> h)<br /></pre>Those are really nice properties to use when equationally reasoning about handles and as we've already seen they allow you to simplify your code in unexpected ways.<br /><br /><br /><h4>Bidirectionality</h4><br />But wait, there's more! <tt>pipes</tt> actually has a fully bidirectional core in the <tt>Pipes.Core</tt> module. You can generalize everything above to use this bidirectional interface with surprising results. For example, in the bidirectional version of <tt>pipes</tt>-based streams, you can have streams that both accept input and produce output upon each step: <pre><br />inputOutputStream :: a -> Effect m b<br /></pre>You also get the following generalized versions of all commands and operators: <ul><li> <tt>request</tt> generalizes <tt>read</tt>, accepting an argument for upstream <li> <tt>respond</tt> generalizes <tt>write</tt>, returning a value from downstream <li> <tt>(\>\)</tt> generalizes <tt>(>~)</tt><li> <tt>(/>/)</tt> generalizes <tt>(~>)</tt></ul>I will leave that as a teaser for now and I will discuss this in greater detail once I finish writing a new tutorial for the underlying bidirectional implementation.<br /><br /><br /><h4>Conclusion</h4><br /><tt>pipes</tt> provides an elegant way to model <tt>Handle</tt>-like abstractions in a very compact way that requires no new extensions to the API. I will soon provide a <tt>pipes-streams</tt> library that implements utilities along these lines to further solidify these concepts.<br /><br />One thing I hope people take away from this is that a <tt>Handle</tt>-based API need not be deeply intertwined with <tt>IO</tt>. All the commands and connectives from <tt>pipes</tt>-based handles are completely independent of the base monad and you can freely generalize the <tt>pipes</tt> notion of handles to pure monads.Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-52369692331762324872014-02-22T12:15:00.000-08:002014-02-22T12:15:47.030-08:00Reasoning about stream programming<p>This post answers a question people sometimes ask me about <code>pipes</code>, which I will paraphase here:</p><blockquote><p>If resource management is not a core focus of <code>pipes</code>, why should I use <code>pipes</code> instead of lazy IO?</p></blockquote><p>Many people who ask this question discovered stream programming through Oleg, who framed the lazy IO problem in terms of resource management. However, I never found this argument compelling in isolation; you can solve most resource management issues simply by separating resource acquisition from the lazy IO, like this:</p><pre><code>import System.IO<br /><br />main = withFile "example.txt" ReadMode $ \handle -> do<br /> str <- hGetContents handle<br /> doSomethingWith str</code></pre><p>You can now easily reason about resource management: the handle lifetime matches the duration of the callback provided to <code>withFile</code>. This solves 95% of resource management scenarios, and libraries like <code>resourcet</code>/<code>conduit</code>/<code>pipes-safe</code> handle the more exotic cases.</p><p>If lazy IO were only a problem for 5% of all use cases, I wouldn't object to its use. Rather, resource management issues are symptoms of a larger problem: lazy IO conflicts with equational reasoning about ordering of effects.</p><h4 id="the-issue-with-lazy-io">The issue with lazy IO</h4><p>Lazy IO is teaching an entire generation of Haskell programmers to "not think very hard" about ordering of side effects. Some consider this a feature, but I believe it is a disadvantage because it conflicts with the original spirit of Haskell IO: preserving equational reasoning. Lazy IO ties the effect order to evaluation order, and equational reasoning does not usually preserve evaluation order.</p><p>In my eyes, Haskell's treatment of IO was one of the major revolutions in programming language theory, because for the first time ever one could formally reason about side effects within the language rather than building an external formal model. This empowered lay programmers to <a href="http://www.haskellforall.com/2013/12/equational-reasoning.html">reason about side effects using simple substitution</a> and greatly lowered the entry barrier to formal reasoning in general.</p><p>Consequently, I was very disappointed to see Haskell squander this immense progress by teaching new programmers to abandon reasoning about ordering of side effects through the use of lazy IO. We should be democratizing computer science by helping the average programmer be precise, rather than encouraging imprecision and instructing newcomers to defer precision to the experts.</p><h4 id="pipes">Pipes</h4><p>I built <code>pipes</code> to restore equational reasoning to stream programming. Besides eschewing lazy IO, <code>pipes</code> also promotes equational reasoning in other ways:</p><ul class="incremental"><li><code>pipes</code> obeys several laws inspired by category theory that simplify equational rearrangements</li><li>The core <code>pipes</code> implementation is minimal</li><li><code>pipes</code> extensions are <code>pipes</code>-independent, so you can reason about them orthogonally</li></ul><p><code>pipes</code> does not come with every bell and whistle pre-installed because <code>pipes</code> is a minimal substrate upon which other streaming abstractions can be built. This reflects the Haskell ethos of only using the minimum power necessary for the task at hand, which improves correctness and minimizes surprise.</p><p>Therefore, you can think of <code>pipes</code> as a gateway drug to equationally reasoning about stream processing. <code>pipes</code> empowers you to be correct and rigorous so that you can spend less time chasing bugs and more time releasing features.</p><p>Those who are interested in equationally reasoning about stream programming can read the <a href="http://hackage.haskell.org/package/pipes-4.1.0/docs/Pipes-Tutorial.html">pipes tutorial</a>, which illustrates how many common streaming abstractions that we take for granted possess beautiful equational properties that we can harness for real applications.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-65628042336552004752014-02-08T22:09:00.000-08:002014-02-09T08:32:44.148-08:00pipes-http-1.0: Streaming HTTP/HTTPS clients<p>The <code>pipes-http</code> package now provides an HTTP client for <code>pipes</code>. This was made possible by Michael Snoyman, who released <code>http-client</code> and <code>http-client-tls</code>, which are a <code>conduit</code>-independent subset of <code>http-conduit</code>. I wanted to thank Michael for releasing those packages, which greatly simplified my job. I even get TLS support for free thanks to him, which is incredible.</p><p>I've chosen to write the thinnest layer possible, only providing functions to stream in request bodies and stream out response bodies. Everything else is re-exported from <code>http-client</code> and <code>http-client-tls</code>. Hopefully this will encourage shared contributions to those libraries from both <code>pipes</code> and <code>conduit</code> users.</p><p>I'll show the library in action with an example of how you would query and format the titles of the top 10 posts on <a href="http://www.reddit.com/r/haskell">/r/haskell</a>:</p><pre><code>-- haskell-reddit.hs<br /><br />{-# LANGUAGE OverloadedStrings #-}<br /><br />import Control.Lens (_Right, (^..))<br />import Control.Lens.Aeson (key, values, _String)<br />import Control.Monad.Trans.State.Strict (evalStateT)<br />import Data.Aeson.Parser (json')<br />import qualified Data.Text as T<br />import qualified Data.Text.IO as T<br />import Pipes.Attoparsec (parse)<br />import Pipes.HTTP<br /><br />main = do<br /> req <- parseUrl "http://www.reddit.com/r/haskell.json"<br /> withManager defaultManagerSettings $ \m -><br /> withHTTP req m $ \resp -> do<br /> json <- evalStateT (parse json') (responseBody resp)<br /><br /> let titles :: [T.Text]<br /> titles = json ^.. _Right<br /> . key "data"<br /> . key "children"<br /> . values<br /> . key "data"<br /> . key "title"<br /> . _String<br /><br /> mapM_ (T.putStrLn . format) (take 10 titles)<br /><br />format :: T.Text -> T.Text<br />format txt =<br /> if T.length txt <= columns<br /> then T.concat [bullet, txt ]<br /> else T.concat [bullet, T.take columns txt, ellipsis]<br /> where<br /> bullet = "[*] "<br /> ellipsis = "..."<br /> columns = 60 - (T.length bullet + T.length ellipsis)</code></pre><p>This example uses <code>pipes-attoparsec</code> to parse the response body as JSON and <code>lens-aeson</code> to query the JSON value for titles:</p><pre><code>$ ./haskell-reddit<br />[*] My Haskell will<br />[*] What are some common "design patterns" when it comes ...<br />[*] The Resumption and Reactive Resumption Monads<br />[*] `cabal install` failing in Vagrant<br />[*] A Lazy Evaluation - Why I Find Learning Haskell Hard<br />[*] The Essence Of Reynolds<br />[*] A test driven haskell course<br />[*] Wheb -- A WAI framework.<br />[*] I wrote a simple &lt;200 line Forth interpreter. Does...<br />[*] GHC iOS 7.8 RC1<br />$</code></pre><p>You can find <code>pipes-http</code> on <a href="http://hackage.haskell.org/package/pipes-http-1.0.0/docs/Pipes-HTTP.html">Hackage</a> or on <a href="https://github.com/Gabriel439/Haskell-Pipes-HTTP-Library">Github</a>.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-91466098384263727042014-02-05T23:56:00.000-08:002014-02-07T17:23:45.356-08:00pipes-parse-3.0: Lens-based parsing<p><code>pipes-parse-3.0.0</code> introduces a new lens-based parsing mechanism. These lenses improve the library in two ways:</p><ul class="incremental"><li><p>They greatly simplify the API</p></li><li><p>They correctly propagate leftovers further upstream</p></li></ul><h4 id="overview">Overview</h4><p>The new parsing API consists of three main types, which roughly parallel the three <code>pipes</code> abstractions:</p><ul class="incremental"><li><p><code>Producer</code>s, which are unchanged from <code>pipes</code></p><pre><code>Producer a m x</code></pre></li><li><p><code>Parser</code>s, which are the parsing analog of <code>Consumer</code>s:</p><pre><code>type Parser a m r = forall x . StateT (Producer a m x) m r</code></pre></li><li><p><code>Lens'</code>es between <code>Producer</code>'s, which are the parsing analog of <code>Pipe</code>s:</p><pre><code>Lens' (Producer a m x) (Producer b m y)</code></pre></li></ul><p>What's neat is that <code>pipes-parse</code> does not need to provide any operators to connect these three abstractions. All of the tools you need already exist in either <code>transformers</code> and <code>lens</code> (or <code>lens-family-core</code>, if you prefer a simpler <code>lens</code> alternative).</p><p>For example, you connect a <code>Parser</code> to a <code>Producer</code> using either <code>runStateT</code>, <code>evalStateT</code>, or <code>execStateT</code>:</p><pre><code> +- Result<br /> |<br />runStateT v<br /> :: Parser a m r -> Producer a m x -> m (r, Producer a m x)<br />evalStateT<br /> :: Parser a m r -> Producer a m x -> m r<br />execStateT<br /> :: Parser a m r -> Producer a m x -> m ( Producer a m x)<br /> ^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^<br /> Parser Input Leftovers</code></pre><p>These correspond to the three possible ways you might want to run a parser:</p><ul class="incremental"><li><code>runStateT</code>: Return the result and leftovers</li><li><code>evalStateT</code>: Return only the result</li><li><code>execStateT</code>: Return only the leftovers</li></ul><p>In fact, two of these functions closely parallel <code>conduit</code> operators:</p><ul class="incremental"><li><code>runStateT</code> parallels <code>($$+)</code></li><li><code>evalStateT</code> parallels <code>($$)</code></li></ul><p>You can also connect <code>Producer</code>s to <code>Lens'</code>es using <code>(^.)</code> or <code>view</code>:</p><pre><code>(^.) :: Producer a m x<br /> -> Lens' (Producer a m x) (Producer b m y)<br /> -> Producer b m y</code></pre><p><code>(^.)</code> parallels <code>conduit</code>'s <code>($=)</code> operator.</p><p>If you want to connect <code>Lens'</code>es to <code>Parser</code>s, you use <code>zoom</code>:</p><pre><code>zoom :: Lens' (Producer a m x) (Producer b m y)<br /> -> Parser b m r<br /> -> Parser a m r</code></pre><p><code>zoom</code> parallels <code>conduit</code>'s <code>(=$)</code> operator.</p><p>Finally, you connect <code>Lens</code>'es to each other using <code>(.)</code> (i.e. function composition):</p><pre><code>(.) :: Lens' (Producer a m x) (Producer b m y)<br /> -> Lens' (Producer b m y) (Producer c m z)<br /> -> Lens' (Producer a m y) (Producer c m z)</code></pre><p><code>(.)</code> parallels <code>conduit</code>'s <code>(=$=)</code> operator.</p><p>Here's a worked example showing off the new API, including newly-added <code>foldl</code> integration:</p><pre><code>import qualified Control.Foldl as L<br />import Lens.Family ((^.))<br />import Lens.Family.State.Strict (zoom)<br />import Pipes<br />import Pipes.Parse<br />import Prelude hiding (span, splitAt)<br /><br />parser :: Parser Int IO ()<br />parser = do<br /> -- Attempt to draw a single element<br /> a <- draw<br /> lift (print a)<br /><br /> -- Sum the next 10 input elements<br /> b <- zoom (splitAt 10) (L.purely foldAll L.sum)<br /> lift (print b)<br /><br /> -- Collect all elements less than 15 into a list<br /> c <- zoom (span (< 15)) drawAll<br /> lift (print c)<br /><br /> -- You can nest `zoom`s<br /> zoom (span (< 20)) $ do<br /> d <- zoom (splitAt 3) (L.purely foldAll L.product)<br /> lift (print d)<br /><br /> e <- peek<br /> lift (print e)<br /><br /> -- ... or compose lenses:<br /> f <- zoom (span (< 20) . splitAt 3) drawAll<br /> lift (print f)<br /><br /> -- Print the remaining elements<br /> g <- drawAll<br /> lift (print g)<br /><br />-- Lenses can modify `Producer`s, too<br />producer :: Monad m => Producer Int m (Producer Int m ())<br />producer = each [1..] ^. span (< 25)</code></pre><p>We get the following output if we connect our <code>parser</code> to our <code>producer</code>:</p><pre><code>>>> evalStateT parser producer<br />Just 1<br />65<br />[12,13,14]<br />4080<br />Just 18<br />[18,19]<br />[20,21,22,23,24]<br />>>></code></pre><h4 id="leftovers">Leftovers</h4><p>There is a subtle difference between <code>pipes-parse</code> and <code>conduit</code>: <code>pipes-parse</code> correctly propagates leftovers further upstream whereas <code>conduit</code> does not. To illustrate this, let's begin from the following implementation of <code>peek</code> for <code>conduit</code>:</p><pre><code>import Control.Monad.Trans.Class (lift)<br />import Data.Conduit<br />import Data.Conduit.List (isolate, sourceList)<br /><br />peek :: Monad m => Sink a m (Maybe a)<br />peek = do<br /> ma <- await<br /> case ma of<br /> Nothing -> return ()<br /> Just a -> leftover a<br /> return ma</code></pre><p><code>peek</code> attempts to draw a value using <code>await</code> and then undraws the value using <code>leftover</code> before returning the result. <code>peek</code> will work correctly when used like this:</p><pre><code>source :: Monad m => Source m Int<br />source = sourceList [1, 2]<br /><br />sink1 :: Show a => Sink a IO ()<br />sink1 = do<br /> ma1 <- peek<br /> ma2 <- peek<br /> lift $ print (ma1, ma2)</code></pre><p>If we feed <code>source</code> to <code>sink1</code>, both <code>peek</code>s will return <code>Just 1</code>, since the first <code>peek</code> undraws the 1 to make it available for the second <code>peek</code>:</p><pre><code>>>> source $$ sink1<br />(Just 1,Just 1)</code></pre><p>But what happens if we delimit the first <code>peek</code> using <code>isolate</code>, which only allows a fixed number of values to flow through? This is a common parsing request: run a parser within a subset of the input.</p><pre><code>sink2 :: Show a => Sink a IO ()<br />sink2 = do<br /> ma1 <- isolate 10 =$ peek<br /> ma2 <- peek<br /> lift $ print (ma1, ma2)</code></pre><p>However, when you compose two conduits, the downstream conduit discards all leftovers when done. Michael is up front about this in the documentation for <code>(=$)</code>:</p><blockquote><p>"Leftover data returned from the Sink will be discarded."</p></blockquote><p>There are similar warnings for <code>($=)</code> and <code>(=$=)</code>, all of which discard leftovers from the right component. We can run <code>sink2</code> to trigger this behavior:</p><pre><code>>>> source $$ sink2<br />(Just 1,Just 2)</code></pre><p>The undrawn <code>1</code> is irreversibly lost when <code>(=$)</code> completes, which is why the second <code>peek</code> reads a <code>2</code> instead of a <code>1</code>.</p><p>The analogous <code>pipes</code> code gets this correct:</p><pre><code>import Lens.Family.State.Strict (zoom)<br />import Pipes<br />import Pipes.Parse<br />import Prelude hiding (splitAt)<br /><br />parser :: Show a => Parser a IO ()<br />parser = do<br /> ma1 <- zoom (splitAt 10) peek<br /> ma2 <- peek<br /> lift $ print (ma1, ma2)<br /><br />producer :: Monad m => Producer Int m ()<br />producer = each [1, 2]</code></pre><p>The <code>pipes-parse</code> version correctly restores the undrawn <code>1</code> so that the second <code>peek</code> also draws a <code>1</code>:</p><pre><code>>>> evalStateT parser producer<br />(Just 1,Just 1)</code></pre><p>The magic is in the <code>splitAt</code> function, which is the <code>pipes-parse</code> analog of <code>conduit</code>'s <code>isolate</code>. Compare the source of <code>isolate</code> (slightly rewritten):</p><pre><code>isolate :: Monad m => Int -> Conduit a m a<br />isolate = loop<br /> where<br /> loop 0 = return ()<br /> loop n = do<br /> ma <- await<br /> case ma of<br /> Nothing -> return ()<br /> Just a -> do<br /> yield a<br /> loop (n - 1)</code></pre><p>... to the source of <code>splitAt</code> (also slightly changed to resemble <code>isolate</code>):</p><pre><code>splitAt<br /> :: Monad m<br /> => Int<br /> -> Lens' (Producer a m x) (Producer a m (Producer a m x))<br />splitAt n0 k p0 = fmap join (k (loop n0 p0))<br /> where<br /> loop 0 p = return p<br /> loop n p = do<br /> x <- lift (next p)<br /> case x of<br /> Left r -> return (return r)<br /> Right (a, p') -> do<br /> yield a<br /> loop (n - 1) p'</code></pre><p>The internal <code>loop</code> function in <code>splitAt</code> corresponds to the internal <code>loop</code> from <code>isolate</code>. The extra magic lies within the first line:</p><pre><code>splitAt n0 k p0 = fmap join (k (loop n0 p0))</code></pre><p>Lens aficionados will recognize this as the dependency-free version of:</p><pre><code>splitAt n0 = iso (loop n0) join</code></pre><p>This not only instructs the <code>Lens</code> in how to isolate out the first ten elements (using <code>loop</code>), but also how to reverse the process and merge the remaining elements back in (using <code>join</code>). This latter information is what makes leftover propagation work.</p><h4 id="getters">Getters</h4><p>Note that not all transformations are reversible and therefore cannot propagate leftovers upstream. Fortunately, the <code>lens</code> model handles this perfectly: just define a <code>Getter</code> instead of a <code>Lens'</code> for transformations that are not reversible:</p><pre><code>getter :: Getter (Producer a m x) (Producer b m y)</code></pre><p>A <code>Getter</code> will only type-check as an argument to <code>(^.)</code> and not <code>zoom</code>, so you cannot propagate <code>unDraw</code> through a <code>Getter</code> like you can with a <code>Lens'</code>:</p><pre><code>zoom getter unDraw -- Type error</code></pre><p>However, you can still use the <code>Getter</code> to transform <code>Producer</code>s:</p><pre><code>producer ^. getter -- Type checks!</code></pre><p>This provides a type-safe way to distinguish transformations that can propagate leftovers from those that cannot.</p><p>In practice, <code>pipes</code>-based parsing libraries just provide functions between <code>Producer</code>s instead of <code>Getter</code>s for simplicity:</p><pre><code>getter :: Producer a m x -> Producer b m y</code></pre><p>... but you can keep using lens-like syntax if you promote them to <code>Getter</code>s using the <code>to</code> function from <code>lens</code>:</p><pre><code>to :: (a -> b) -> Getter a b<br /><br />producer ^. to getter</code></pre><h4 id="lens-support">Lens Support</h4><p>Some people may worry about the cost of using <code>lens</code> in conjunction with <code>pipes-parse</code> because <code>lens</code> is not beginner-friendly and has a large dependency graph, so I'd like to take a moment to advertise Russell O'Connor's <code>lens-family-core</code> library. <code>lens-family-core</code> is a beginner-friendly <code>lens</code>-alternative that is (mostly) <code>lens</code>-compatible. It provides much simpler, beginner-friendly types and has a really tiny dependency profile.</p><p>Note that <code>pipes-parse</code> does not depend on either lens library. This is one of the beautiful aspects about lenses: you can write a lens-compatible library using nothing more than stock components from the Prelude and the <code>transformers</code> library. You can therefore combine <code>pipes-parse</code> with either <code>lens-family-core</code> or <code>lens</code> without any conflicts. This provides a smooth transition path from the beginner-friendly <code>lens-family-core</code> library to the expert-friendly <code>lens</code> library.</p><h4 id="laws">Laws</h4><p>The lens approach is nice because you get many laws for free. For example, you get several associativity laws, like the fact that <code>(^.)</code> associates with <code>(.)</code>:</p><pre><code>(producer ^. lens1) ^. lens2 = producer ^. (lens1 . lens2)</code></pre><p>Similarly, <code>(.)</code> associates with <code>zoom</code>:</p><pre><code>zoom lens1 (zoom lens2 parser) = zoom (lens1 . lens2) parser</code></pre><p>... and the trio of <code>evalStateT</code>/<code>(^.)</code>/<code>zoom</code> all associate:</p><pre><code>evalStateT parser (producer ^. lens)<br /> = evalStateT (zoom lens parser) producer</code></pre><p>Also, lens composition associates, because it's just function composition:</p><pre><code>(lens1 . lens2) . lens3 = lens1 . (lens2 . lens3)</code></pre><p>You even get the following identity laws for free:</p><pre><code>producer ^. id = producer<br /><br />zoom id producer = producer<br /><br />f . id = f<br /><br />id . f = f</code></pre><p>However, there is one caveat: many of the lenses in <code>pipes-parse</code> do not satisfy certain lens laws. Specifically, they do not satisfy these laws:</p><pre><code>-- Law violation #1<br />view lens (set lens x a) /= x<br /><br />-- Law violation #2<br />zoom lens $ do x <- m /= do x <- zoom lens m<br /> f x zoom lens (f x)</code></pre><p>Law violation #1 arises because I don't know of a lens-like abstraction that type-checks as a <code>Getter</code> and a <code>Focusing</code>, but not a <code>Setter</code>.</p><p>However, law #2 directly conflicts with a core <code>pipes-parse</code> feature, specifically lenses like <code>splitAt</code> that delimit parsers. Here's why:</p><pre><code>zoom (splitAt n) $ do x <- m /= do x <- zoom (splitAt n) m<br /> f x zoom (splitAt n) (f x)</code></pre><p>Limiting one parser to <code>n</code> elements is not the same as limiting its two sub-parsers to <code>n</code> elements each. So if you use <code>pipes-parse</code> lenses you cannot rely on <code>zoom</code> being a monad morphism when doing equational reasoning. This was a tough call for me to make, but I felt that delimiting parsers were more important than the monad morphism laws for <code>zoom</code>. Perhaps there is a more elegant solution that I missed that resolves this conflict, but I'm still pleased with the current solution.</p><h4 id="generality">Generality</h4><p>Notice how all of these functions and laws are completely <code>pipes</code>-independent. Any streaming abstraction that has some <code>Producer</code>-like type can implement lens-based parsing, too, and also get all of these laws for free, including pure streams (such as strict or lazy <code>Text</code>) or even lazy <code>IO</code>. Other streaming libraries can therefore benefit from this exact same trick.</p><h4 id="batteries-included">Batteries included</h4><p>Downstream libraries have been upgraded to use the <code>pipes-parse</code> API, including <code>pipes-bytestring</code>, <code>pipes-binary</code>, and <code>pipes-attoparsec</code>.This means that right now you can do cool things like:</p><pre><code>import Lens.Family.State.Strict (zoom)<br />import Pipes<br />import Pipes.Parse<br />import Pipes.Binary<br />import qualified Pipes.ByteString as ByteString<br /><br />parser :: Parser ByteString IO ()<br />parser = zoom (ByteString.splitAt 100 . decoded) $ do<br /> x <- draw -- Draw a decoded Int<br /> lift $ print (x :: Maybe Int)<br /> unDraw 99 -- This undraws the encoded bytes<br /><br />producer :: Monad m => Producer ByteString m ()<br />producer = for (each [(1::Int)..]) encode</code></pre><p>The above <code>parser</code> composes two lenses so that it zooms in on a stream of decoded ints that consume no more than 100 bytes. This will transmute <code>draw</code> to now receive decoded elements and <code>unDraw</code> will magically re-encode elements when pushing back leftovers:</p><pre><code>>>> producer' <- execStateT parser producer<br />Just 1<br />>>> evalStateT parser producer'<br />Just 99</code></pre><p>Also, Michael Thompson has released a draft of <a href="http://hackage.haskell.org/package/pipes-text"><code>pipes-text</code></a> on Hackage. This means you can parse a byte stream through a UTF-8 lens and any undrawn input will be encoded back into the original byte stream as bytes. Here is an example program show-casing this neat feature:</p><pre><code>-- decode.hs<br /><br />import Data.ByteString (ByteString)<br />import Data.Text (Text)<br />import Lens.Family.State.Strict (zoom)<br />import Pipes<br />import Pipes.Parse<br />import qualified Pipes.ByteString as ByteString<br />import qualified Pipes.Text as Text<br /><br />-- Retrieve all `Text` chunks up to 10 characters<br />parser :: Monad m => Parser ByteString m [Text]<br />parser = zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll<br /><br />main = do<br /> (textChunks, leftovers) <- runStateT parser ByteString.stdin<br /> print textChunks<br /><br /> -- Now print the remaining `ByteString` chunks<br /> byteChunks <- evalStateT drawAll leftovers<br /> print byteChunks</code></pre><p>The unused bytes from the decoded stream get correctly undrawn to the original byte stream!</p><pre><code>$ ./decode<br />Hello, 世界!!!<Enter><br />["Hello, \19990\30028!"]<br />abcdefg<Enter><br /><Ctrl-D><br />["!!\n","abcdefg\n"]<br />$</code></pre><p>The remainder of the first line is undrawn by the decoder and restored back as the original encoding bytes.</p><p>Note that the above example is line-buffered, which is why the program does not output the <code>Text</code> chunks immediately after the 10th input character. However, if you disable line buffering then all chunks have just a single character and the example wouldn't illustrate how leftovers worked.</p><p>The above example could have also been written as a single <code>Parser</code>: <pre><code>parser :: Parser ByteString IO ()<br />parser = do<br /> texts <- zoom (Text.decodeUtf8 . Text.splitAt 10) drawAll<br /> lift (print texts)<br /> bytes <- drawAll<br /> lift (print bytes)<br /><br />main = evalStateT parser ByteString.stdin<br /></code></pre><p>... but I wanted to make the leftover passing explicit to emphasize that the leftover behavior holds correctly whether or not you exit and re-enter <code>pipes</code>.</p><h4 id="conclusion">Conclusion</h4><p>The <code>pipes-parse</code> API lets you propagate leftovers upstream, encode leftover support in the type system, and equationally reason about code with several theoretical laws. Additionally, <code>pipes-parse</code> reuses existing functions and concepts from <code>lens</code> and <code>StateT</code> rather than introducing a new set of abstractions to learn.</p><p>Note that <code>pipes-parse</code> used to have a some <code>FreeT</code>-based operations as well. These have been moved to a separate <code>pipes-group</code> library (and upgraded to use lenses) since they are conceptually orthogonal to parsing and I will blog about this library in a separate post.</p><p>You can find <code>pipes-parse</code> on <a href="http://hackage.haskell.org/package/pipes-parse">Hackage</a> or <a href="https://github.com/Gabriel439/Haskell-Pipes-Parse-Library">Github</a>, and it comes with an <a href="http://hackage.haskell.org/package/pipes-parse-3.0.1/docs/Pipes-Parse-Tutorial.html">extended tutorial</a>.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-4262361330223403882014-02-02T05:17:00.000-08:002014-02-05T21:12:58.144-08:00Streaming logging<p>I see many beginners use <code>WriterT [a]</code> in their programs to log outputs like this:</p><pre><code>import Control.Monad.Trans.Class (lift)<br />import Control.Monad.Trans.Writer<br />import Prelude hiding (log)<br /><br />log :: Monad m => String -> WriterT [String] m ()<br />log string = tell [string]<br /><br />example :: WriterT [String] IO ()<br />example = do<br /> log "Printing 1 ..."<br /> lift $ print 1<br /> log "Printing 2 ..."<br /> lift $ print 2<br /><br />main = do<br /> strings <- execWriterT example<br /> mapM_ putStrLn strings</code></pre><p>This is not the best approach, because you cannot retrieve any logged values until the computation is complete:</p><pre><code>>>> main<br />1<br />2<br />Printing 1 ...<br />Printing 2 ...</code></pre><p>We cannot appropriate this for long-running programs like servers where we wish to inspect logged output while the program is still running. Worse, this approach will waste memory storing all logged values until the very end.</p><p>The simplest way to solve this is just to modify our computation to take the desired logging function as a parameter:</p><pre><code>parametrize :: (String -> IO ()) -> IO ()<br />parametrize log = do<br /> log "Printing 1 ..."<br /> print 1<br /> log "Printing 2 ..."<br /> print 2<br /><br />main = parametrize putStrLn</code></pre><p>Now we log output immediately without wasting any memory:</p><pre><code>>>> main<br />Printing 1 ...<br />1<br />Printing 2 ...<br />2</code></pre><p>However, this approach is still a little brittle. For example, suppose we wish to log these lines to a file. As a basic denial-of-service precaution we might wish to cap the number of logged lines (or put the log file on a separate partition, but humor me). Limiting the logged output would necessitate the use of an <code>IORef</code> to coordinate between logging callbacks:</p><pre><code>import Control.Monad (when)<br />import Data.IORef<br /><br />main = do<br /> count <- newIORef 0<br /><br /> let putStrLn' maxLines string = do<br /> n <- readIORef count<br /> when (n < maxLines) $ do<br /> putStrLn string<br /> writeIORef count (n + 1)<br /><br /> parametrize (putStrLn' 1)</code></pre><p>Now we have tightly integrated state into our log function and increased our dependence on <code>IO</code>. I prefer to limit unnecessary <code>IO</code> and also avoid callback hell, so I will introduce a third solution:</p><pre><code>import Pipes<br />import Prelude hiding (log)<br /><br />log :: Monad m => String -> Producer String m ()<br />log = yield<br /><br />piped :: Producer String IO ()<br />piped = do<br /> log "Printing 1 ..."<br /> lift $ print 1<br /> log "Printing 2 ..."<br /> lift $ print 2<br /><br />main = runEffect $ for piped (lift . putStrLn)</code></pre><p>The <code>piped</code> code is syntactically identical to our original <code>example</code>, but this time we stream values immediately instead of deferring all results to a large list at the end:</p><pre><code>>>> main<br />Printing 1 ...<br />1<br />Printing 2 ...<br />2</code></pre><p>In fact, the <code>for</code> combinator from <code>Pipes</code> exactly recapitulates the behavior of our <code>parametrized</code> function. <code>(for p f)</code> replaces every <code>yield</code> in <code>p</code> with <code>f</code>, and <code>log</code> is just a synonym for <code>yield</code>, so we can freely substitute <code>log</code> commands using <code>for</code>. It's as if we had directly parametrized our <code>piped</code> function on the logging action:</p><pre><code>for piped (lift . putStrLn)<br /><br />-- Replace each `log`/`yield` with `(lift . putStrLn)`<br />= do (lift . putStrLn) "Printing 1 ..."<br /> lift $ print 1<br /> (lift . putStrLn) "Printing 2 ..."<br /> lift $ print 2<br /><br />-- Simplify a little bit<br />= do lift $ putStrLn $ "Printing 1 ..."<br /> lift $ print 1<br /> lift $ putStrLn $ "Printing 2 ..."<br /> lift $ print 2<br /><br />-- `lift` is a monad morphism, so we can factor it out<br />= lift $ do putStrLn $ "Printing 1 ..."<br /> print 1<br /> putStrLn $ "Printing 2 ..."<br /> print 2</code></pre><p>... and all that <code>runEffect</code> does is remove the <code>lift</code>:</p><pre><code>runEffect (for piped yield)<br /><br />= runEffect $ lift $ do<br /> putStrLn $ "Printing 1 ..."<br /> print 1<br /> putStrLn $ "Printing 2 ..."<br /> print 2<br /><br />-- runEffect (lift m) = m<br />= do putStrLn $ "Printing 1 ..."<br /> print 1<br /> putStrLn $ "Printing 2 ..."<br /> print 2</code></pre><p>However, unlike the <code>parametrized</code> example, <code>piped</code> is more flexible. We can manipulate <code>yield</code>s in many more ways than just the <code>for</code> combinator. For example, we can use the <code>take</code> <code>Pipe</code> from <code>Pipes.Prelude</code> to easily limit the number of logged outputs:</p><pre><code>import qualified Pipes.Prelude as Pipes<br /><br />limit :: Monad m => Int -> Pipe String String m r<br />limit n = do<br /> Pipes.take n -- Forward the first `n` outputs<br /> Pipes.drain -- Ignore the remaining log statements<br /><br />main = runEffect $ for (piped >-> limit 1) (lift . putStrLn)</code></pre><p>... or for people who prefer <code>(>->)</code> over <code>for</code>, you can write the entire thing as one long pipeline:</p><pre><code>main = runEffect $ piped >-> limit 1 >-> Pipes.stdoutLn</code></pre><p>This will now only output the first logged value:</p><pre><code>>>> main<br />Printing 1 ...<br />1<br />2</code></pre><p>We get all of this with a strict separation of concerns. All three stages in our pipeline are separable and reusable in the same spirit as Unix pipes.</p><p>So the next time you need to log a stream of values, consider using a <code>Producer</code> to stream values immediately instead of building up a large list in memory. <code>Producer</code>s preserve a great deal of flexibility with very few dependencies and low syntactic overhead. You can learn more about <code>pipes</code> by reading <a href="http://hackage.haskell.org/package/pipes-4.1.0/docs/Pipes-Tutorial.html">the tutorial</a>.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com4tag:blogger.com,1999:blog-1777990983847811806.post-60012476763423974032014-01-19T02:32:00.000-08:002014-01-20T06:12:02.953-08:00Shortcut fusion for pipes<p>Rewrite rules are a powerful tool that you can use to optimize Haskell code without breaking backwards compatibility. This post will illustrate how I use rewrite rules to implement a form of shortcut fusion for <a href="http://hackage.haskell.org/package/pipes">my <code>pipes</code> stream programming library</a>. I compare <code>pipes</code> performance before and after adding shortcut fusion and I also compare the peformance of <code>pipes-4.0.2</code> vs. <code>conduit-1.0.10</code> and <code>io-streams-1.1.4.0</code>.</p><p>This post also includes a small introduction to Haskell's rewrite rule system for those who are interested in optimizing their own libraries.</p><p>Edit: This post originally used the term "stream fusion", but Duncan Coutts informed me that the more appropriate term for this is probably "short-cut fusion".</p><h4 id="rule-syntax">Rule syntax</h4><p>The following rewrite rule from <code>pipes</code> demonstrates the syntax for defining optimization rules.</p><pre><code>{-# RULES "for p yield" forall p . for p yield = p #-}<br />-- ^^^^^^^^^^^^^ ^ ^^^^^^^^^^^^^^^<br />-- Label | Substitution rule<br />-- |<br />-- `p` can be anything -+</code></pre><p>All rewrite rules are substitution rules, meaning that they instruct the compiler to replace anything that matches the left-hand side of the rule with the right-hand side. The above rewrite rule says to always replace <code>for p yield</code> with <code>p</code>, no matter what <code>p</code> is, as long as everything type-checks before and after substitution.</p><p>Rewrite rules are typically used to substitute code with equivalent code of greater efficiency. In the above example, <code>for p yield</code> is a <code>for</code> loop that re-<code>yield</code>s every element of <code>p</code>. The re-<code>yield</code>ed stream is behaviorally indistinguishable from the original stream, <code>p</code>, because all that <code>for p yield</code> does is replace every <code>yield</code> in <code>p</code> with yet another <code>yield</code>. However, while both sides of the equation behave identically their efficiency is not the same; the left-hand side is less efficient because <code>for</code> loops are not free.</p><h4 id="safety">Safety</h4><p>Rewrite rules are not checked for correctness. The only thing the compiler does is verify that the left-hand side and right-hand side of the equation both type-check. The programmer who creates the rewrite rule is responsible for proving that the substitution preserves the original behavior of the program.</p><p>In fact, rewrite rules can be used to rewrite terms from other Haskell libraries without limitation. For this reason, modules with rewrite rules are automatically marked unsafe by Safe Haskell and they must be explicitly marked <code>TrustWorthy</code> to be used by code marked <code>Safe</code>.</p><p>You can verify a rewrite rule is safe using <a href="http://www.haskellforall.com/2013/12/equational-reasoning.html">equational reasoning</a>. If you can reach the right-hand side of the equation from the left-hand side using valid code substitutions, then you can prove (with some caveats) that the two sides of the equation are functionally identical.</p><p><code>pipes</code> includes a <a href="http://www.haskellforall.com/2013/10/manual-proofs-for-pipes-laws.html">complete set of proofs</a> for its rewrite rules for this purpose. For example, the above rewrite rule is proven <a href="https://github.com/Gabriel439/Haskell-Pipes-Library/blob/cafda206cc5409cbece991010c52db6bdce9fdcb/laws.md#pointful-4">here</a>, where <code>(//>)</code> is an infix synonym for <code>for</code> and <code>respond</code> is a synonym for <code>yield</code>:</p><pre><code>for = (//>)<br /><br />yield = respond</code></pre><p>This means that equational reasoning is useful for more than just proving program correctness. You can also use derived equations to optimize your program , assuming that you already know which side of the equation is more efficient.</p><h4 id="laws">Laws</h4><p>Rewrite rules have a significant limitation: we cannot possibly anticipate every possible expression that downstream users might build using our libraries. So how can we optimize as much code as possible without an excessive proliferation of rewrite rules?</p><p>I anecdotally observe that equations inspired from category theory prove to be highly versatile optimizations that fit within a small number of rules. These equations include:</p><ul class="incremental"><li><p>Category laws</p></li><li><p>Functor laws</p></li><li><p>Natural transformation laws (i.e. free theorems)</p></li></ul><p>The first half of the shortcut fusion optimization consists of three monad laws, which are a special case of category laws. For those new to Haskell, the three monad laws are:</p><pre><code>-- Left Identity<br />return x >>= f = f x<br /><br />-- Right Identity<br />m >>= return = m<br /><br />-- Associativity<br />(m >>= f) >>= g = m >>= (\x -> f x >>= g)</code></pre><p>If you take these three laws and replace <code>(>>=)</code>/<code>return</code> with <code>for</code>/<code>yield</code> (and rename <code>m</code> to <code>p</code>, for 'p'ipe), you get the following "for loop laws":</p><pre><code>-- Looping over a yield simplifies to function application<br />for (yield x) f = f x<br /><br />-- Re-yielding every element returns the original stream<br />for p yield = p<br /><br />-- You can transform two passes over a stream into a single pass<br />for (for p f) g = for p (\x -> for (f x) g)</code></pre><p>This analogy to the monad laws is precise because <code>for</code> and <code>yield</code> are actually <code>(>>=)</code> and <code>return</code> for <a href="http://hackage.haskell.org/package/pipes-4.0.2/docs/Pipes.html#g:5">the <code>ListT</code> monad</a> when you newtype them appropriately, and they really form a <code>Monad</code> in the Haskell sense of the word.</p><p>What's amazing is that these monad laws also double as shortcut fusion optimizations when we convert them to rewrite rules. We already encountered the second law as our first rewrite rule, but the other two laws are useful rewrite rules, too:</p><pre><code>{-# RULES<br /> "for (yield x) f" forall x f .<br /> for (yield x) f = f x<br /> ; "for (for p f) g" forall p f g .<br /> for (for p f) g = for p (\x -> for (f x) g)<br /> #-}</code></pre><p>Note that the <code>RULES</code> pragma lets you group multiple rules together, as long as you separate them by semicolons. Also, there is no requirement that the rule label must match the left-hand side of the equation, but I use this convention since I'm bad at naming rewrite rules. This labeling convention also helps when diagnosing which rules fired (see below) without having to consult the original rule definitions.</p><h4 id="free-theorems">Free theorems</h4><p>These three rewrite rules alone do not suffice to optimize most <code>pipes</code> code. The reason why is that most idiomatic <code>pipes</code> code is not written in terms of <code>for</code> loops. For example, consider the <code>map</code> function from <code>Pipes.Prelude</code>:</p><pre><code>map :: Monad m => (a -> b) -> Pipe a b m r<br />map f = for cat (\x -> yield (f x))</code></pre><p>The idiomatic way to transform a pipe's output is to compose the <code>map</code> pipe downstream:</p><pre><code>p >-> map f</code></pre><p>We can't optimize this using our shortcut fusion rewrite rules unless we rewrite the above code to the equivalent <code>for</code> loop:</p><pre><code>for p (\y -> yield (f y))</code></pre><p>In other words, we require the following theorem:</p><pre><code>p >-> map f = for p (\y -> yield (f y))</code></pre><p>This is actually a special case of the following "free theorem":</p><pre><code>-- Exercise: Derive the previous equation from this one<br /><br />p1 >-> for p2 (\y -> yield (f y))<br /> = for (p1 >-> p2) (\y -> yield (f y))</code></pre><p>A free theorem is an equation that you can prove solely from studying the types of all terms involved. I will omit the proof of this free theorem for now, but I will discuss how to derive free theorems in detail in a follow-up post. For now, just assume that the above equations are correct, as codified by the following rewrite rule:</p><pre><code>{-# RULES<br /> "p >-> map f" . forall p f .<br /> p >-> map f = for p (\y -> yield (f y))<br /> #-}</code></pre><p>With this rewrite rule the compiler can begin to implement simple map fusion. To see why, we'll compose two <code>map</code> pipes and then pretend that we are the compiler, applying rewrite rules at every opportunity. Every time we apply a rewrite rule we will refer to the rule by its corresponding string label:</p><pre><code>map f >-> map g<br /><br />-- "p >-> map f" rule fired<br />= for (map f) (\y -> yield (g y))<br /><br />-- Definition of `map`<br />= for (for cat (\x -> yield (f x))) (\y -> yield (g y))<br /><br />-- "for (for p f) g" rule fired<br />= for cat (\x -> for (yield (f x)) (\y -> yield (g y)))<br /><br />-- "for (yield x) f" rule fired<br />= for cat (\x -> yield (g (f x)))</code></pre><p>This is identical to a single <code>map</code> pass, which we can prove by equational reasoning:</p><pre><code>for cat (\x -> yield (g (f x)))<br /><br />-- Definition of `(.)`, in reverse<br />= for cat (\x -> yield ((g . f) x))<br /><br />-- Definition of `map`, in reverse<br />= map (g . f)</code></pre><p>So those rewrite rules sufficed to fuse the two <code>map</code> passes into a single pass. You don't have to take my word for it, though. For example, let's say that we want to prove that these rewrite rules fire for the following sample program, which increments, doubles, and then discards every number from 1 to 100000000:</p><pre><code>-- map-fusion.hs<br /><br />import Pipes<br />import qualified Pipes.Prelude as P<br /><br />main = runEffect $<br /> for (each [1..10^8] >-> P.map (+1) >-> P.map (*2)) discard</code></pre><p>The <code>-ddump-rule-firings</code> flag will output every rewrite rule that fires during compilation, identifying each rule with the string label accompanying the rule:</p><pre><code>$ ghc -O2 -ddump-rule-firings map-fusion.hs<br />[1 of 1] Compiling Main ( test.hs, test.o )<br />...<br />Rule fired: p >-> map f<br />Rule fired: for (for p f) g<br />Rule fired: for (yield x) f<br />...</code></pre><p>I've highlighted the rule firings that correspond to map fusion, although there are many other rewrite rules that fire (including more shortcut fusion rule firings).</p><h4 id="shortcut-fusion">Shortcut fusion</h4><p>We don't have to limit ourselves to just fusing <code>map</code>s. Many pipes in <code>Pipes.Prelude</code> has an associated free theorem that rewrites pipe composition into an equivalent <code>for</code> loop. After these rewrites, the "for loop laws" go to town on the pipeline and fuse it into a single pass.</p><p>For example, the <code>filter</code> pipe has a rewrite rule similar to <code>map</code>:</p><pre><code>{-# RULES<br /> "p >-> filter pred" forall p pred .<br /> p >-> filter pred =<br /> for p (\y -> when (pred) (yield y))<br /> #-}</code></pre><p>So if we combine <code>map</code> and <code>filter</code> in a pipeline, they will also fuse into a single pass:</p><pre><code>p >-> map f >-> filter pred<br /><br />-- "p >-> map f" rule fires<br />for p (\x -> yield (f x)) >-> filter pred<br /><br />-- "p >-> filter pred" rule fires<br />for (for p (\x -> yield (f x))) (\y -> when (pred y) (yield y))<br /><br />-- "for (for p f) g" rule fires<br />for p (\x -> for (yield (f x)) (\y -> when (pred y) (yield y)))<br /><br />-- for (yield x) f" rule fires<br />for p (\x -> let y = f x in when (pred y) (yield y))</code></pre><p>This is the kind of single pass loop we might have written by hand if we were <code>pipes</code> experts, but thanks to rewrite rules we can write high-level, composable code and let the library automatically rewrite it into efficient and tight loops.</p><p>Note that not all pipes are fusible in this way. For example the <code>take</code> pipe cannot be fused in this way because it cannot be rewritten in terms of a <code>for</code> loop.</p><h4 id="benchmarks">Benchmarks</h4><p>These rewrite rules make fusible pipe stages essentially free. To illustrate this I've set up a <code>criterion</code> benchmark testing running time as a function of the number of <code>map</code> stages in a pipeline:</p><pre><code>import Criterion.Main<br />import Data.Functor.Identity (runIdentity)<br />import Pipes<br />import qualified Pipes.Prelude as P<br /><br />n :: Int<br />n = 10^6<br /><br />main = defaultMain<br /> [ bench' "1 stage " $ \n -><br /> each [1..n]<br /> >-> P.map (+1)<br /> , bench' "2 stages" $ \n -><br /> each [1..n]<br /> >-> P.map (+1)<br /> >-> P.map (+1)<br /> , bench' "3 stages" $ \n -><br /> each [1..n]<br /> >-> P.map (+1)<br /> >-> P.map (+1)<br /> >-> P.map (+1)<br /> , bench' "4 stages" $ \n -><br /> each [1..n]<br /> >-> P.map (+1)<br /> >-> P.map (+1)<br /> >-> P.map (+1)<br /> >-> P.map (+1)<br /> ]<br /> where<br /> bench' label f = bench label $<br /> whnf (\n -> runIdentity $ runEffect $ for (f n) discard)<br /> (10^5 :: Int)</code></pre><p>Before shortcut fusion (i.e. <code>pipes-4.0.0</code>), the running time scales linearly with the number of <code>map</code> stages:</p><pre><code>warming up<br />estimating clock resolution...<br />mean is 24.53411 ns (20480001 iterations)<br />found 80923 outliers among 20479999 samples (0.4%)<br /> 32461 (0.2%) high severe<br />estimating cost of a clock call...<br />mean is 23.89897 ns (1 iterations)<br /><br />benchmarking 1 stage <br />mean: 4.480548 ms, lb 4.477734 ms, ub 4.485978 ms, ci 0.950<br />std dev: 19.42991 us, lb 12.11399 us, ub 35.90046 us, ci 0.950<br /><br />benchmarking 2 stages<br />mean: 6.304547 ms, lb 6.301067 ms, ub 6.310991 ms, ci 0.950<br />std dev: 23.60979 us, lb 14.01610 us, ub 37.63093 us, ci 0.950<br /><br />benchmarking 3 stages<br />mean: 10.60818 ms, lb 10.59948 ms, ub 10.62583 ms, ci 0.950<br />std dev: 61.05200 us, lb 34.79662 us, ub 102.5613 us, ci 0.950<br /><br />benchmarking 4 stages<br />mean: 13.74065 ms, lb 13.73252 ms, ub 13.76065 ms, ci 0.950<br />std dev: 61.13291 us, lb 29.60977 us, ub 123.3071 us, ci 0.950</code></pre><p>Stream fusion (added in <code>pipes-4.0.1</code>) makes additional <code>map</code> stages essentially free:</p><pre><code>warming up<br />estimating clock resolution...<br />mean is 24.99854 ns (20480001 iterations)<br />found 1864216 outliers among 20479999 samples (9.1%)<br /> 515889 (2.5%) high mild<br /> 1348320 (6.6%) high severe<br />estimating cost of a clock call...<br />mean is 23.54777 ns (1 iterations)<br /><br />benchmarking 1 stage <br />mean: 2.427082 ms, lb 2.425264 ms, ub 2.430500 ms, ci 0.950<br />std dev: 12.43505 us, lb 7.564554 us, ub 20.11641 us, ci 0.950<br /><br />benchmarking 2 stages<br />mean: 2.374217 ms, lb 2.373302 ms, ub 2.375435 ms, ci 0.950<br />std dev: 5.394149 us, lb 4.270983 us, ub 8.407879 us, ci 0.950<br /><br />benchmarking 3 stages<br />mean: 2.438948 ms, lb 2.436673 ms, ub 2.443006 ms, ci 0.950<br />std dev: 15.11984 us, lb 9.602960 us, ub 23.05668 us, ci 0.950<br /><br />benchmarking 4 stages<br />mean: 2.372556 ms, lb 2.371644 ms, ub 2.373949 ms, ci 0.950<br />std dev: 5.684231 us, lb 3.955916 us, ub 9.040744 us, ci 0.950</code></pre><p>In fact, once you have just two stages in your pipeline, <code>pipes</code> greatly outperforms <code>conduit</code> and breaks roughly even with <code>io-streams</code>. To show this I've written up a benchmark comparing <code>pipes</code> performance against these libraries for both pure loops and loops that are slightly IO-bound (by writing to <code>/dev/null</code>):</p><pre><code>import Criterion.Main<br />import Data.Functor.Identity (runIdentity)<br />import qualified System.IO as IO<br /><br />import Data.Conduit<br />import qualified Data.Conduit.List as C<br /><br />import Pipes<br />import qualified Pipes.Prelude as P<br /><br />import qualified System.IO.Streams as S<br /><br />criterion :: Int -> IO ()<br />criterion n = IO.withFile "/dev/null" IO.WriteMode $ \h -><br /> defaultMain<br /> [ bgroup "pure"<br /> [ bench "pipes" $ whnf (runIdentity . pipes) n<br /> , bench "conduit" $ whnf (runIdentity . conduit) n<br /> , bench "io-streams" $ nfIO (iostreams n)<br /> ]<br /> , bgroup "io"<br /> [ bench "pipes" $ nfIO (pipesIO h n)<br /> , bench "conduit" $ nfIO (conduitIO h n)<br /> , bench "iostreams" $ nfIO (iostreamsIO h n)<br /> ]<br /> ]<br /><br />pipes :: Monad m => Int -> m ()<br />pipes n = runEffect $<br /> for (each [1..n] >-> P.map (+1) >-> P.filter even) discard<br /><br />conduit :: Monad m => Int -> m ()<br />conduit n =<br /> C.enumFromTo 1 n $= C.map (+1) $= C.filter even $$ C.sinkNull<br /><br />iostreams :: Int -> IO ()<br />iostreams n = do<br /> is0 <- S.fromList [1..n]<br /> is1 <- S.map (+1) is0<br /> is2 <- S.filter even is1<br /> S.skipToEof is2<br /><br />pipesIO :: IO.Handle -> Int -> IO ()<br />pipesIO h n = runEffect $<br /> each [1..n]<br /> >-> P.map (+1)<br /> >-> P.filter even<br /> >-> P.map show<br /> >-> P.toHandle h<br /><br />conduitIO :: IO.Handle -> Int -> IO ()<br />conduitIO h n =<br /> C.enumFromTo 1 n<br /> $= C.map (+1)<br /> $= C.filter even<br /> $= C.map show<br /> $$ C.mapM_ (IO.hPutStrLn h)<br /><br />iostreamsIO :: IO.Handle -> Int -> IO ()<br />iostreamsIO h n = do<br /> is0 <- S.fromList [1..n]<br /> is1 <- S.map (+1) is0<br /> is2 <- S.filter even is1<br /> is3 <- S.map show is2<br /> os <- S.makeOutputStream $ \ma -> case ma of<br /> Just str -> IO.hPutStrLn h str<br /> _ -> return ()<br /> S.connect is3 os<br /><br />main = criterion (10^6)</code></pre><p>The benchmarks place <code>pipes</code> neck-and-neck with <code>io-streams</code> on pure loops and 10% slower on slightly IO-bound code. Both libraries perform faster than <code>conduit</code>:</p><pre><code>warming up<br />estimating clock resolution...<br />mean is 24.50726 ns (20480001 iterations)<br />found 117040 outliers among 20479999 samples (0.6%)<br /> 45158 (0.2%) high severe<br />estimating cost of a clock call...<br />mean is 23.89208 ns (1 iterations)<br /><br />benchmarking pure/pipes<br />mean: 24.04860 ms, lb 24.02136 ms, ub 24.10872 ms, ci 0.950<br />std dev: 197.3707 us, lb 91.05894 us, ub 335.2267 us, ci 0.950<br /><br />benchmarking pure/conduit<br />mean: 172.8454 ms, lb 172.6317 ms, ub 173.1824 ms, ci 0.950<br />std dev: 1.361239 ms, lb 952.1500 us, ub 1.976641 ms, ci 0.950<br /><br />benchmarking pure/io-streams<br />mean: 24.16426 ms, lb 24.12789 ms, ub 24.22919 ms, ci 0.950<br />std dev: 242.5173 us, lb 153.9087 us, ub 362.4092 us, ci 0.950<br /><br />benchmarking io/pipes<br />mean: 267.7021 ms, lb 267.1789 ms, ub 268.4542 ms, ci 0.950<br />std dev: 3.189998 ms, lb 2.370387 ms, ub 4.392541 ms, ci 0.950<br /><br />benchmarking io/conduit<br />mean: 310.3034 ms, lb 309.8225 ms, ub 310.9444 ms, ci 0.950<br />std dev: 2.827841 ms, lb 2.194127 ms, ub 3.655390 ms, ci 0.950<br /><br />benchmarking io/iostreams<br />mean: 239.6211 ms, lb 239.2072 ms, ub 240.2354 ms, ci 0.950<br />std dev: 2.564995 ms, lb 1.879984 ms, ub 3.442018 ms, ci 0.950</code></pre><p>I hypothesize that <code>pipes</code> performs slightly slower on <code>IO</code> compared to <code>io-streams</code> because of the cost of calling <code>lift</code>, whereas <code>iostreams</code> operates directly within the <code>IO</code> monad at all times.</p><p>These benchmarks should be taken with a grain of salt. All three libraries are most frequently used in strongly IO-bound scenarios, where the overhead of each library is pretty much negligible. However, this still illustrates how big of an impact shortcut fusion can have on pure code paths.</p><h4 id="conclusions">Conclusions</h4><p><code>pipes</code> is a stream programming library with a strong emphasis on theory and the library's contract with the user is a set of laws inspired by category theory. My original motivation behind proving these laws was to fulfill the contract, but I only later realized that the <code>for</code> loop laws doubled as fortuitous shortcut fusion optimizations. This is a recurring motif in Haskell: thinking mathematically pays large dividends.</p><p>For this reason I like to think of Haskell as applied category theory: I find that many topics I learn from category theory directly improve my Haskell code. This post shows one example of this phenomenon, where shortcut fusion naturally falls out of the monad laws for <code>ListT</code>.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-64889681216853814642013-12-25T12:41:00.000-08:002013-12-26T13:34:12.793-08:00Equational reasoning<p>You may have heard that Haskell is "great for equational reasoning", but perhaps you didn't know precisely what that meant. This post will walk through an intermediate example of equational reasoning to show how you can interpret Haskell code by hand by applying substitutions as if your code were a mathematical expression. This process of interpreting code by substituting equals-for-equals is known as equational reasoning.</p><br /><h4 id="equality">Equality</h4><p>In Haskell, when you see a single equals sign it means that the left-hand and right-hand side are substitutable for each other both ways. For example, if you see:</p><pre><code>x = 5</code></pre><p>That means that wherever you see an <code>x</code> in your program, you can substitute it with a <code>5</code>, and, vice versa, anywhere you see a <code>5</code> in your program you can substitute it with an <code>x</code>. Substitutions like these preserve the behavior of your program.</p><p>In fact, you can "run" your program just by repeatedly applying these substitutions until your program is just one giant <code>main</code>. To illustrate this, we will begin with the following program, which should print the number <code>1</code> three times:</p><pre><code>import Control.Monad<br /><br />main = replicateM_ 3 (print 1)</code></pre><p><code>replicateM_</code> is a function that repeats an action a specified number of times. Its type when specialized to <code>IO</code> is:</p><pre><code>replicateM_ :: Int -> IO () -> IO ()</code></pre><p>The first argument is an <code>Int</code> specifying how many times to repeat the action and the second argument is the action we wish to repeat. In the above example, we specify that we wish to repeat the <code>print 1</code> action three times.</p><p>But what if you don't believe me? What if you wanted to prove to yourself that it repeated the action three times? How would you prove that?</p><br /><h4 id="use-the-source">Use the source!</h4><p>You can locate the source to Haskell functions using one of three tricks:</p><ul class="incremental"><li><p>Use <a href="http://www.haskell.org/hoogle/">Hoogle</a>, which can also search for functions by type signature</p></li><li><p>Use <a href="http://holumbus.fh-wedel.de/hayoo/hayoo.html">Hayoo!</a>, which is like hoogle, but searches a larger package database and is more strict about matches</p></li><li><p>Use <a href="http://www.google.com/">Google</a> and search for "hackage <function>". This also works well for searching for packages.</p></li></ul><p>Using either of those three tricks we would locate <code>replicateM_</code> <a href="http://hackage.haskell.org/package/base-4.6.0.1/docs/Control-Monad.html#v:replicateM_">here</a> and then we can click the <a href="http://hackage.haskell.org/package/base-4.6.0.1/docs/src/Control-Monad.html#replicateM_">Source</a> link to the right of it to view its definition, which I reproduce here:</p><pre><code>replicateM_ n x = sequence_ (replicate n x)</code></pre><p>The equals sign means that any time we see something of the form <code>replicateM_ n x</code>, we can substitute it with <code>sequence_ (replicate n x)</code>, for any choice of <code>n</code> or <code>x</code>. For example, if we choose the following values for <code>n</code> and <code>x</code>:</p><pre><code>n = 3<br /><br />x = print 1</code></pre><p>... then we obtain the following more specific equation:</p><pre><code>replicateM_ 3 (print 1) = sequence_ (replicate 3 (print 1))</code></pre><p>We will use this equation to replace our program's <code>replicateM_</code> command with an equal program built from <code>sequence</code> and <code>replicate</code>:</p><pre><code>main = sequence_ (replicate 3 (print 1))</code></pre><p>The equation tells us that this substitution is safe and preserves the original behavior of our program.</p><p>Now, in order to simplify this further we must consult the definition of both <code>replicate</code> and <code>sequence_</code>. When in doubt which one to pick, always pick the outermost function because Haskell is lazy and evaluates everything from outside to in.</p><p>In this case, our outermost function is <code>sequence_</code>, defined <a href="http://hackage.haskell.org/package/base-4.6.0.1/docs/src/Control-Monad.html#sequence_">here</a>:</p><pre><code>-- | Combine a list of actions into a single action<br />sequence_ :: [IO ()] -> IO () -- I've simplified the type<br />sequence_ ms = foldr (>>) (return ()) ms</code></pre><p>We will substitute this into our <code>main</code>, noting that <code>ms</code> is <code>replicate 3 (print 1)</code> for the purpose of this substitution:</p><pre><code>main = foldr (>>) (return ()) (replicate 3 (print 1))</code></pre><p>Now <code>foldr</code> is our outermost function, so we'll consult the <a href="http://hackage.haskell.org/package/base-4.6.0.1/docs/src/GHC-Base.html#foldr">definition of <code>foldr</code></a>:</p><pre><code>-- | 'foldr k z' replaces all `(:)`s with `k`<br />-- and replaces `[]` with `z`<br />foldr :: (a -> b -> b) -> b -> [a] -> b<br />foldr k z [] = z<br />foldr k z (x:xs) = k x (foldr k z xs)</code></pre><p>Here we see two equations. Both equations work both ways, but we don't really know which equation to apply unless we know whether or not the third argument to <code>foldr</code> is an empty list or not. We must evaluate the call to <code>replicate</code> to see whether it will pass us an empty list or a non-empty list, so we consult the <a href="http://hackage.haskell.org/package/base-4.6.0.1/docs/src/GHC-List.html#replicate">definition of <code>replicate</code></a>:</p><pre><code>-- | Create a list containing `x` repeated `n` times<br />replicate :: Int -> a -> [a]<br />replicate n x = take n (repeat x)<br /><br />-- Apply the substitution, using these values:<br />-- n = 3<br />-- x = print 1<br /><br />main = foldr (>>) (return ()) (take 3 (repeat (print 1)))</code></pre><p>Boy, this rabbit hole keeps getting deeper and deeper! However, we must persevere. Let's now consult the <a href="http://hackage.haskell.org/package/base-4.6.0.1/docs/src/GHC-List.html#take">definition of <code>take</code></a>:</p><pre><code>-- | Take the first `n` elements from a list<br />take :: Int -> [a] -> [a]<br />take n _ | n <= 0 = []<br />take _ [] = []<br />take n (x:xs) = x : take (n-1) xs</code></pre><p>Here we see three equations. The first one has a predicate, saying that the equality is only valid if <code>n</code> is less than or equal to 0. In our case <code>n</code> is 3, so we skip that equation. However, we cannot distinguish which of the latter two equations to use unless we know whether <code>repeat (print 1)</code> produces an empty list or not, so we must consult the <a href="http://hackage.haskell.org/package/base-4.6.0.1/docs/src/GHC-List.html#repeat">definition of <code>repeat</code></a>:</p><pre><code>-- | `repeat x` creates an infinite list of `x`s<br />repeat :: a -> [a]<br />repeat x = x:repeat x -- Simplified from the real definition<br /><br />-- Apply the substitution, using these values:<br />-- x = print 1<br /><br />main = foldr (>>) (return ()) (take 3 (print 1:repeat (print 1)))</code></pre><p>The buck stops here! Although <code>repeat</code> is infinitely recursive, we don't have to fully evaluate it. We can just evaluate it once and lazily defer the recursive call, since all we need to know for now is that the list has at least one value. This now provides us with enough information to select the third equation for <code>take</code> that requires a non-empty list as its argument:</p><pre><code>take n (x:xs) = x : take (n-1) xs<br /><br />-- Apply the substitution, using these values:<br />-- n = 3<br />-- x = print 1<br />-- xs = repeat (print 1)<br /><br />main = foldr (>>) (return ()) (print 1:take 2 (repeat (print 1)))</code></pre><p>Now we know for sure that <code>foldr</code>'s third argument is a non-empty list, so we can select the second equation for <code>foldr</code>:</p><pre><code>foldr k z (x:xs) = k x (foldr k z xs)<br /><br />-- Apply the substitution, using these values:<br />-- k = (>>)<br />-- z = return ()<br />-- x = print 1<br />-- xs = take 2 (repeat (print 1))<br /><br />main =<br /> (>>) (print 1)<br /> (foldr (>>) (return ()) (take 2 (repeat (print 1))))<br /><br />-- Note: "(>>) x y" is the same thing as "x >> y"<br />main = print 1 >> foldr (>>)<br /> (return ())<br /> (take 2 (repeat (print 1)))<br /><br />-- Note: "x >> y" is the same thing as "do { x; y }<br />main = do<br /> print 1<br /> foldr (>>) (return ()) (take 2 (repeat (print 1)))</code></pre><p>Now our Haskell runtime knows enough information to realize that it needs to <code>print</code> a single 1. The language is smart and will execute the first <code>print</code> statement before further evaluating the call to <code>foldr</code>.</p><p>We can repeat this process two more times, cycling through evaluating <code>repeat</code>, <code>take</code>, and <code>foldr</code>, which emit two additional <code>print</code> commands:</p><pre><code>-- Evaluate `repeat`<br />main = do<br /> print 1<br /> foldr (>>) (return ()) (take 2 (print 1:repeat (print 1)))<br /><br />-- Evaluate `take`<br />main = do<br /> print 1<br /> foldr (>>) (return ()) (print 1:take 1 (repeat (print 1)))<br /><br />-- Evaluate `foldr`<br />main = do<br /> print 1<br /> print 1<br /> foldr (>>) (return ()) (take 1 (repeat (print 1)))<br /><br />-- Evaluate `repeat`<br />main = do<br /> print 1<br /> print 1<br /> foldr (>>) (return ()) (take 1 (print 1:repeat (print 1)))<br /><br />-- Evaluate `take`<br />main = do<br /> print 1<br /> print 1<br /> foldr (>>) (return ()) (print 1:take 0 (repeat (print 1)))<br /><br />-- Evaluate `foldr`<br />main = do<br /> print 1<br /> print 1<br /> print 1<br /> foldr (>>) (return ()) (take 0 (repeat (print 1)))</code></pre><p>Now something different will happen. Let's revisit our definition of <code>take</code>:</p><pre><code>take n _ | n <= 0 = []<br />take _ [] = []<br />take n (x:xs) = x : take (n-1) xs</code></pre><p>This time the first equation matches, because <code>n</code> is now equal to <code>0</code>. However, we also know the third equation will match because <code>repeat</code> will emit a new element. Whenever more than one equation matches Haskell takes the first one by default, so we use the first equation to substitute the call to <code>take</code> with an empty list:</p><pre><code>main = do<br /> print 1<br /> print 1<br /> print 1<br /> foldr (>>) (return ()) []</code></pre><p>This then triggers the first equation for <code>foldr</code>:</p><pre><code>foldr k z [] = z<br /><br />-- Therefore:<br />main = do<br /> print 1<br /> print 1<br /> print 1<br /> return ()</code></pre><p>There it is! We've fully expanded out our use of <code>replicateM_</code> to prove that it <code>print</code>s the number 1 three times. For reasons I won't go into, we can also remove the final <code>return ()</code> and finish with:</p><pre><code>main = do<br /> print 1<br /> print 1<br /> print 1</code></pre><br /><h4 id="equational-reasoning---part-1">Equational reasoning - Part 1</h4><p>Notice how Haskell has a very straightforward way to interpret all code: substitution. If you can substitute equals-for-equals then you can interpret a Haskell program on pen and paper. Substitution is the engine of application state.</p><p>Equally cool: we never needed to understand the language runtime to know what our code did. The Haskell language has a very limited role: ensure that substitution is valid and otherwise get out of our way.</p><p>Unlike imperative languages, there are no extra language statements such as <code>for</code> or <code>break</code> that we need to understand in order to interpret our code. Our printing "loop" that repeated three times was just a bunch of ordinary Haskell code. This is a common theme in Haskell: "language features" are usually ordinary libraries.</p><p>Also, had we tried the same pen-and-paper approach to interpreting an imperative language we would have had to keep track of temporary values somewhere in the margins while evaluating our program. In Haskell, all the "state" of our program resides within the expression we are evaluating, and in our case the "state" was the integer argument to <code>take</code> that we threaded through our subsitutions. Haskell code requires less context to understand than imperative code because Haskell expressions are self-contained.</p><p>We didn't need to be told to keep track of state. We just kept mindlessly applying substitution and perhaps realized after the fact that what we were doing was equivalent to a state machine. Indeed, state is implemented within the language just like everything else.</p><br /><h4 id="proof-reduction">Proof reduction</h4><p>Proving the behavior of our code was really tedious, and we're really interested in proving more generally reusable properties rather than deducing the behavior of specific, disposable programs. However, we spent a lot of effort to prove our last equation, so we want to pick our battles wisely and only spend time proving equations that we can reuse heavily.</p><p>So I will propose that we should only really bother to prove the following four equations in order to cover most common uses of <code>replicateM_</code>:</p><pre><code>-- replicateM_ "distributes" over addition<br />replicateM_ 0 x = return ()<br />replicateM_ (m + n) x = replicateM_ m x >> replicateM_ n x<br /><br />-- replicateM_ "distributes" over multiplication<br />replicateM_ 1 = id<br />replicateM_ (m * n) = replicateM_ m . replicateM_ n</code></pre><p>The last two equations are written in a "point-free style" to emphasize that <code>replicateM_</code> distributes in a different way over multiplication. If you expand those two equations out to a "point-ful style" you get:</p><pre><code>replicateM_ (m * n) x = replicateM_ m (replicateM_ n x)<br />replicateM 1 x = x</code></pre><p>If we could prove these equations, then we could much more easily deduce how <code>replicateM_</code> behaves for our example program, because we could transform our code more rapidly like this:</p><pre><code>replicateM_ 3 (print 1)<br /><br />-- 3 = 1 + 1 + 1<br />= replicateM_ (1 + 1 + 1) (print 1)<br /><br />-- replicateM_ distributes over addition<br />= do replicateM_ 1 (print 1)<br /> replicateM_ 1 (print 1)<br /> replicateM_ 1 (print 1)<br /><br />-- replicateM_ 1 x = x<br />= do print 1<br /> print 1<br /> print 1</code></pre><p>These four master equations are still very tedious to prove all in one go, but we can break this complex task into smaller bite-sized tasks. As a bonus, this divide-and-conquer approach will also produce several other useful and highly reusable equations along the way.</p><p>Let's begin by revisiting the definition of <code>replicateM_</code>:</p><pre><code>replicateM_ n x = sequence_ (replicate n x)</code></pre><p>Like many Haskell functions, <code>replicateM_</code> is built from two smaller composable pieces: <code>sequence_</code> and <code>replicate</code>. So perhaps we can also build our proofs from smaller and composable proofs about the individual behaviors of <code>sequence_</code> and <code>replicate</code>.</p><p>Indeed, <code>replicate</code> possesses a set of properties that are remarkably similar to <code>replicateM_</code>:</p><pre><code>-- replicate distributes over addition<br />replicate 0 x = []<br />replicate (m + n) x = replicate m x ++ replicate n x<br /><br />-- replicate distributes over multiplication<br />replicate 1 = return<br />replicate (m * n) = replicate m <=< replicate n</code></pre><p>Again, the last two equations can be expanded to a "point-ful" form:</p><pre><code>-- `return x` is `[x]` for lists<br />replicate 1 x = [x]<br /><br />-- `(<=<)` is like a point-free `concatMap` for lists<br />replicate (m * n) x = concatMap (replicate m) (replicate n x)</code></pre><p>These four <code>replicate</code> equations are easier to prove than the corresponding <code>replicateM_</code> equations. If we can somehow prove these simpler equations, then all we have to do is then prove that <code>sequence_</code> lifts all of the <code>replicate</code> proofs into the equivalent <code>replicateM_</code> proofs:</p><pre><code>-- sequence_ lifts list concatenation to sequencing<br />sequence_ [] = return ()<br />sequence_ (xs ++ ys) = sequence_ xs >> sequence_ ys<br /><br />-- (sequence_ .) lifts list functions to ordinary functions<br />sequence_ . return = id<br />sequence_ . (f <=< g) = (sequence_ . f) . (sequence_ . g)<br /><br />-- Point-ful version of last two equations:<br />sequence_ [x] = x<br />sequence_ (concatMap f (g x)) = sequence_ (f (sequence_ (g x)))</code></pre><p>It might not be obvious at first, but the above four equations for <code>sequence_</code> suffice to transform the <code>replicate</code> proofs into the analogous <code>replicateM_</code> proofs. For example, this is how we would prove the first <code>replicateM_</code> equation in terms of the first <code>replicate</code> equation and the first <code>sequence</code> equation:</p><pre><code>replicateM_ 0 x<br /><br />-- Definition of `replicateM_`<br />= sequence_ (replicate 0 x)<br /><br />-- replicate 0 x = []<br />= sequence_ []<br /><br />-- sequence_ [] = return ()<br />= return ()</code></pre><p>That was simple. It is only slightly more tricky to prove the second equation:</p><pre><code>replicateM_ (m + n) x<br /><br />-- Definition of `replicateM_`<br />= sequence_ (replicate (m + n) x)<br /><br />-- replicate (m + n) x = replicate m x ++ replicate n x<br />= sequence_ (replicate m x ++ replicate n x)<br /><br />-- sequence (xs ++ ys) = sequence xs >> sequence ys<br />= sequence_ (replicate m x) >> sequence_ (replicate n x)<br /><br />-- Definition of `replicateM_`, in reverse!<br />= replicateM_ m x >> replicateM_ n x</code></pre><p>Notice how the last step of the proof involves using the original <code>replicateM_</code> equation, but instead substituting from right-to-left!</p><pre><code>-- +-- We can substitute this way --+<br />-- | |<br />-- ^ v<br /><br />replicateM_ n x = sequence_ (replicate n x)<br /><br />-- ^ v<br />-- | |<br />-- +--------- Or this way! ---------+</code></pre><p>This is a nice example of the utility of bidirectional substitution. We can replace the body of a function with the equivalent function call.</p><p>The third <code>replicateM_</code> equation is also simple to prove in terms of the third <code>replicate</code> and <code>sequence_</code> equations. I will use the point-ful forms of all these equations for simplicity:</p><pre><code>replicateM_ 1 x<br /><br />-- Definition of `replicateM_`<br />= sequence_ (replicate 1 x)<br /><br />-- replicate 1 x = [x]<br />= sequence_ [x]<br /><br />-- sequence_ [x] = x<br />= x</code></pre><p>The fourth property is surprisingly short, too:</p><pre><code>replicateM_ (m * n) x<br /><br />-- Definition of `replicateM_`<br />= sequence_ (replicate (m * n) x)<br /><br />-- replicate (m * n) x = concatMap (replicate m) (replicate n x)<br />= sequence_ (concatMap (replicate m) (replicate n x))<br /><br />-- sequence_ (concatMap f (g x)) = sequence_ (f (sequence_ (g x)))<br />= sequence_ (replicate m (sequence_ (replicate n x)))<br /><br />-- Definition of `replicateM_`, in reverse<br />= replicateM_ m (replicateM_ n x)</code></pre><br /><h4 id="equational-reasoning---part-2">Equational reasoning - Part 2</h4><p>We reduced our proofs of the <code>replicateM_</code> properties to smaller proofs for <code>replicate</code> and <code>sequence</code> properties. The overhead of this proof reduction is tiny and we can gain the benefit of reusing proofs for <code>replicate</code> and <code>sequence</code>.</p><p>As programmers we try to reuse code when we program and the way we promote code reuse is to divide programs into smaller composable pieces that are more reusable. Likewise, we try to reuse proofs when we equationally reason about code and the way we encourage proof reuse is to divide larger proofs into smaller proofs using proof reduction. In the above example we reduced the four equations for <code>replicateM_</code> into four equations for <code>replicate</code> and four equations for <code>sequence_</code>. These smaller equations are equally useful in their own right and they can be reused by other people as sub-proofs for their own proofs.</p><p>However, proof reuse also faces the same challenges as code reuse. When we break up code into smaller pieces sometimes we take things too far and create components we like to think are reusable but really aren't. Similarly, when we reduce proofs sometimes we pick sub-proofs that are worthless and only add more overhead to the entire proof process. How can we sift out the gold from the garbage?</p><p>I find that the most reusable proofs are category laws or functor laws of some sort. In fact, every single proof from the previous section was a functor law in disguise. To learn more about functor laws and how they arise everywhere you can read another post of mine about the <a href="http://www.haskellforall.com/2012/09/the-functor-design-pattern.html">functor design pattern</a>.</p><br /><h4 id="proof-techniques">Proof techniques</h4><p>This section will walk through the complete proofs for the <code>replicate</code> equations to provide several worked examples and to also illustrate several useful proof tricks.</p><p>I deliberately write these proofs to be reasonably detailed and to skip as few steps as possible. In practice, though, proofs become much easier the more you equationally reason about code because you get better at taking larger steps.</p><p>Let's revisit the equations we wish to prove for <code>replicate</code>, in point-ful form:</p><pre><code>replicate 0 x = []<br />replicate (m + n) x = replicate m x ++ replicate n x<br /><br />replicate 1 x = [x]<br />replicate (m * n) x = concatMap (replicate m) (replicate n x)</code></pre><p>... where <code>concatMap</code> is <a href="http://hackage.haskell.org/package/base-4.3.1.0/docs/src/GHC-List.html#concatMap">defined as</a>:</p><pre><code>concatMap :: (a -> [b]) -> [a] -> [b]<br />concatMap f = foldr ((++) . f) []</code></pre><p>Now we must use the same equational reasoning skills we developed in the first section to prove all four of these equations.</p><p>The first equation is simple:</p><pre><code>replicate 0 x<br /><br />-- Definition of `replicate`<br />= take 0 (repeat x)<br /><br />-- Definition of `take`<br />= []<br /><br />-- Proof complete</code></pre><p>The second equation is trickier:</p><pre><code>replicate (m + n) x<br /><br />-- Definition of `take`<br />= take (m + n) (repeat x)</code></pre><p>We can't proceed further unless we know whether or not <code>m + n</code> is greater than <code>0</code>. For simplicity we'll assume that <code>m</code> and <code>n</code> are non-negative.</p><p>We then do something analogous to "case analysis" on <code>m</code>, pretending it is like a Peano number. That means that <code>m</code> is either <code>0</code> or positive (i.e. greater than 0). We'll first prove our equation for the case where <code>m</code> equals <code>0</code>:</p><pre><code>-- Assume: m = 0<br />= take (0 + n) (repeat x)<br /><br />-- 0 + n = n<br />= take n (repeat x)<br /><br />-- Definition of `(++)`, in reverse<br />= [] ++ take n (repeat x)<br /><br />-- Definition of `take`, in reverse<br />= take 0 (repeat x) ++ take n (repeat x)<br /><br />-- m = 0<br />= take m (repeat x) ++ take n (repeat x)<br /><br />-- Definition of `replicate`, in reverse<br />= replicate m x ++ replicate n x<br /><br />-- Proof complete for m = 0</code></pre><p>Then there is the second case, where <code>m</code> is positive, meaning that we can represent <code>m</code> as <code>1</code> plus some other non-negative number <code>m'</code>:</p><pre><code>-- Assume: m = 1 + m'<br />= take (1 + m' + n) (repeat x)<br /><br />-- Definition of `repeat`<br />= take (1 + m' + n) (x:repeat x)<br /><br />-- Definition of `take`<br />= x:take (m' + n) (repeat x)<br /><br />-- Definition of `replicate` in reverse<br />= x:replicate (m' + n) x</code></pre><p>Now we can use induction to reuse the original premise since <code>m'</code> is strictly smaller than <code>m</code>. Since we are assuming that <code>m</code> is non-negative this logical recursion is well-founded and guaranteed to eventually bottom out at the base case where <code>m</code> equals <code>0</code>:</p><pre><code>-- Induction: reuse the premise<br />= x:(replicate m' x ++ replicate n x)<br /><br />-- Definition of `(++)`, in reverse<br />= (x:replicate m' x) ++ replicate n x<br /><br />-- Definition of `replicate`<br />= (x:take m' (repeat x)) ++ replicate n x<br /><br />-- Definition of `take`, in reverse<br />= take (1 + m') (repeat x) ++ replicate n x<br /><br />-- Definition of `replicate`, in reverse<br />= replicate (1 + m') x ++ replicate n x<br /><br />-- m = 1 + m', in reverse<br />= replicate m x ++ replicate n x<br /><br />-- Proof complete for m = 1 + m'</code></pre><p>This completes the proof for both cases so the proof is "total", meaning that we covered all possibilities. Actually, that's a lie because really rigorous Haskell proofs must account for the possibility of non-termination (a.k.a. "bottom"). However, I usually consider proofs that don't account for non-termination to be good enough for most practical purposes.</p><p>The third <code>replicate</code> law is very straightforward to prove:</p><pre><code>replicate 1 x<br /><br />-- Definition of `replicate`<br />= take 1 (repeat x)<br /><br />-- Definition of `repeat`<br />= take 1 (x:repeat x)<br /><br />-- Definition of `take`<br />= x:take 0 (repeat x)<br /><br />-- Definition of `take`<br />= x:[]<br /><br />-- [x] is syntactic sugar for `x:[]`<br />= [x]</code></pre><p>The fourth equation for <code>replicate</code> also requires us to split our proof into two branches. Either <code>n</code> is zero or greater than zero. First we consider the case where <code>n</code> is zero:</p><pre><code>replicate (m * n) x<br /><br />-- Assume: n = 0<br />= replicate 0 x<br /><br />-- replicate 0 x = []<br />= []<br /><br />-- Definition of `foldr`, in reverse<br />= foldr ((++) . replicate m) [] []<br /><br />-- Definition of `concatMap`, in reverse<br />= concatMap (replicate m) []<br /><br />-- replicate 0 x = [], in reverse<br />= concatMap (replicate m) (replicate 0 x)<br /><br />-- n = 0, in reverse<br />= concatMap (replicate m) (replicate n x)<br /><br />-- Proof complete for n = 0</code></pre><p>Then we consider the case where <code>n</code> is greater than zero:</p><pre><code>replicate (m * n) x<br /><br />-- Assume: n = 1 + n'<br />= replicate (m * (1 + n')) x<br /><br />-- m * (1 + n') = m + m * n'<br />= replicate (m + m * n') x<br /><br />-- replicate distributes over addition<br />= replicate m x ++ replicate (m * n') x<br /><br />-- Induction: reuse the premise<br />= replicate m x ++ concatMap (replicate m) (replicate n' x)<br /><br />-- Definition of `concatMap`<br />= replicate m x ++ foldr ((++) . replicate m) [] (replicate n' x)<br /><br />-- Definition of `foldr`, in reverse<br />= foldr ((++) . replicate m)) [] (x:replicate n' x)<br /><br />-- Definition of `concatMap`, in reverse<br />= concatMap (replicate m) (x:replicate n' x)<br /><br />-- Definition of `replicate`<br />= concatMap (replicate m) (x:take n' (repeat x))<br /><br />-- Definition of `take`, in reverse<br />= concatMap (replicate m) (take (1 + n') (x:repeat x))<br /><br />-- n = 1 + n', in reverse<br />= concatMap (replicate m) (take n (x:repeat x))<br /><br />-- Definition of `repeat`, in reverse<br />= concatMap (replicate m) (take n (repeat x))<br /><br />-- Definition of `replicate`, in reverse<br />= concatMap (replicate m) (replicate n x)<br /><br />-- Proof complete for n = 1 + n'</code></pre><p>Hopefully these proofs give an idea for the amount of effort involved to prove properties of moderate complexity. I omitted the final part of proving the <code>sequence_</code> equations in the interest of space, but they make for a great exercise.</p><br /><h4 id="equational-reasoning---part-3">Equational Reasoning - Part 3</h4><p>Reasoning about Haskell differs from reasoning about code in other languages. Traditionally, reasoning about code would require:</p><ul class="incremental"><li>building a formal model of a program's algorithm,</li><li>reasoning about the behavior of the formal model using its axioms, and</li><li>proving that the program matches the model.</li></ul><p>In a purely functional language like Haskell you formally reason about your code within the language itself. There is no need for a separate formal model because the code is already written as a bunch of formal equations. This is what people mean when they say that Haskell has strong ties to mathematics, because you can reason about Haskell code the same way you reason about mathematical equations.</p><p>This is why Haskell syntax for function definitions deviates from mainstream languages. All function definitions are just equalities, which is why Haskell is great for equational reasoning.</p><p>This post illustrated how equational reasoning in Haskell can scale to larger complexity through the use of proof reduction. A future post of mine will walk through a second tool for reducing proof complexity: type classes inspired by mathematics that come with associated type class laws.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com2tag:blogger.com,1999:blog-1777990983847811806.post-65599578919461998682013-12-19T19:43:00.000-08:002013-12-19T19:43:11.287-08:00Lift error handling with lens-like syntax<p>One of the main deficiencies of <code>transformers</code> is that the <code>MonadTrans</code> class does not let you lift functions like <code>catchError</code>. The <code>mtl</code> library provides one solution to this problem, which is to type-class <code>catchError</code> and <code>throwError</code> using the <code>MonadError</code> type class.</p><p>That's not to say that <code>transformers</code> has no solution for lifting <code>catchError</code> and <code>throwError</code>; it's just really verbose. Each module provides a <code>liftCatch</code> function that you use to lift a <code>catchError</code> function from the base monad to the transformed monad.</p><p>For example, <code>Control.Monad.Trans.State</code> provides the following <code>liftCatch</code> function:</p><pre><code>Control.Monad.Trans.State.liftCatch<br /> :: (m (a, s) -> (e -> m (a, s)) -> m (a, s))<br /> -> StateT s m a -> (e -> StateT s m a) -> StateT s m a</code></pre><p>If your monad transformer stack was:</p><pre><code>myStack :: (Monad m, Error e) => StateT s (ErrorT e m)</code></pre><p>... then you would apply <code>liftCatch</code> directly to <code>catchError</code> to get a catch function that worked at the <code>StateT</code> layer:</p><pre><code>Control.Monad.Trans.State.liftCatch catchError<br /> :: (Monad m, Error e)<br /> => StateT s (ErrorT e m) a<br /> -> (e -> StateT s (ErrorT e m) a)<br /> -> StateT s (ErrorT e m) a</code></pre><p>But what if you had a <code>WriterT</code> layer in between like this:</p><pre><code>myStack :: (Monad m) => StateT s (WriterT w (ErrorT e m))</code></pre><p>You'd have to use <code>liftCatch</code> from the <code>Control.Monad.Trans.Writer</code> module to further lift the <code>catchError</code> to work with this stack:</p><pre><code>import Control.Monad.Trans.State as S<br />import Control.Monad.Trans.Writer as W<br /><br />S.liftCatch (W.liftCatch catchError)<br /> :: (Monad m, Error e)<br /> => StateT s (WriterT w (ErrorT e m)) a<br /> -> (e -> StateT s (WriterT w (ErrorT e m)) a)<br /> -> StateT s (WriterT w (ErrorT e m)) a</code></pre><p>The advantage of this solution is that type inference works extraordinarily well and it is Haskell98. The disadvantage of this solution is that it is very verbose.</p><p>So I will propose a less verbose solution that has a syntax resembling the <code>lens</code> library, although there will be no true lenses involved (I think). Define the following <code>catching</code> function:</p><pre><code>catching<br /> :: (Monad m, Error e)<br /> => ((ErrorT e m a -> (e -> ErrorT e m a) -> ErrorT e m a) -> r)<br /> -> r<br />catching k = k catchError</code></pre><p>The <code>k</code> in the above function will be a series of composed <code>liftCatch</code> functions that we will apply to <code>catchError</code>. However, in the spirit of the <code>lens</code> library we will rename these <code>liftCatch</code> functions to be less verbose and more hip and sexy:</p><pre><code>import qualified Control.Monad.Trans.Maybe as M<br />import qualified Control.Monad.Trans.Reader as R<br />import qualified Control.Monad.Trans.State as S<br />import qualified Control.Monad.Trans.Writer as W<br />import qualified Pipes.Lift as P<br /><br />state = S.liftCatch<br /><br />writer = W.liftCatch<br /><br />reader = R.liftCatch<br /><br />maybe = M.liftCatch<br /><br />pipe = P.liftCatchError</code></pre><p>Now let's say that we have a monad transformer stack of type:</p><pre><code>myStack<br /> :: (Monad m, Error e)<br /> => StateT s (MaybeT (ErrorT e m)) r</code></pre><p>If I want to catch something at the outer <code>StateT</code> level, I would just write:</p><pre><code>myStack = catching (state.maybe) action $ \e -> do<br /> -- The handler if `action` fails<br /> ...</code></pre><p>If I add more layers to my monad transformer stack, all I have to do is point deeper to the stack by composing more references:</p><pre><code>myStack<br /> :: (Monad m, Error e, Monoid w)<br /> => Pipe a b (StateT s (WriterT w (MaybeT m))) r<br />myStack = catching (pipe.state.writer.maybe) action $ \e -> ...</code></pre><p>Also, these references are first-class Haskell values, so you can bundle them and manipulate them with your favorite functional tools:</p><pre><code>myStack = do<br /> let total = pipe.state.writer.maybe<br /> catching total action1 $ \e -> ...<br /> catching total action2 $ \e -> ...</code></pre><p>This approach has a few advantages over the traditional <code>MonadError</code> approach:</p><ul class="incremental"><li>You get improved type inference</li><li>You get type errors earlier in development. With <code>MonadError</code> the compiler will not detect an error until you try to run your monad transformer stack.</li><li>You get better type errors. <code>MonadError</code> errors will arise at a distance where you call <code>runErrorT</code> even though the logical error is probably at the site of the <code>catchError</code> function.</li><li>Functional references are first class and type classes are not</li></ul><p>However, we don't want to lift just <code>catchError</code>. There are many other functions that <code>transformers</code> can lift such as <code>local</code>, <code>listen</code>, and <code>callCC</code>. An interesting question would be whether this is some elegant abstraction that packages all these lifting operations into a simple type in the same way that lenses package getters, setters, traversals, maps, and zooms into a single abstraction. If there were, then we could reuse the same references for catching, listening, and other operations that are otherwise difficult to lift:</p><pre><code>listening (state.error) m<br /><br />passing (maybe.writer) m<br /><br />catching (writer.maybe) m $ \e -> ...</code></pre><p>I have no idea if such an elegant abstraction exists, though, which is why I'm writing this post to solicit ideas.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-41246756009904470912013-11-01T23:22:00.000-07:002013-11-02T07:34:19.334-07:00Test stream programming using Haskell's `QuickCheck`<p><code>pipes</code> is a stream programming library built on top of a foundation of basic category theory. The core of the library consists of a set of five categories that all intersect in a single streaming data type and the library's contract is the set of associated category laws.</p><p>For example, one such category is the "respond category", which <code>pipes</code> uses to implement <code>for</code> loops and <code>ListT</code>. The two key operations are <code>(/>/)</code>, which is the category composition operator, and <code>respond</code>, which is the identity. These must satisfy the following category laws:</p><pre><code>respond />/ f = f -- Left Identity<br /><br />f />/ respond = f -- Right Identity<br /><br />(f />/ g) />/ h = f />/ (g />/ h) -- Associativity</code></pre><p>Previously, I described how I manually proved the category laws <a href="http://www.haskellforall.com/2013/10/manual-proofs-for-pipes-laws.html">using equational reasoning</a>, which elicited a strong response from readers that I should also verify the laws empirically using Haskell's <code>QuickCheck</code> library. After all, Haskell's <code>QuickCheck</code> library shines at automated property-based testing. Why not just fire up <code>QuickCheck</code> and test something like:</p><pre><code>> quickCheck $ \f -> f />/ respond == f</code></pre><p>However, this leads to several problems:</p><ul class="incremental"><li>You can't compare pipes for <code>Eq</code>uality</li><li><code>QuickCheck</code> can't <code>Show</code> pipes when it discovers a counter-example</li><li>You can't generate <code>Arbitrary</code> pipes</li></ul><p>The latter is the most challenging problem of the three to solve: how do we generate random pipes to test? This has to be done in such a way that it exercises the system and efficiently discovers corner cases.</p><h4 id="randomizing-pipes">Randomizing pipes</h4><p>I decided to try encoding a random pipe as random sequence of operations (specifically Kleisli arrows). This actually works out quite well, because we already have two natural operations built into the core bidirectional API: <code>request</code> and <code>respond</code>. Both of them allow information to pass through twice, as illustrated by the following ASCII diagrams:</p><pre><code>request :: Monad m => a' -> Proxy a' a y' y m a<br /><br /> a'<br /> |<br /> +----|----+<br /> | | |<br /> a' <=====/ |<br /> | |<br /> a ======\ |<br /> | | |<br /> +----|----+<br /> v<br /> a<br /><br /><br />respond :: Monad m => a -> Proxy x' x a' a m a'<br /><br /> a<br /> |<br /> +----|----+<br /> | | |<br /> | \ /==== a'<br /> | X |<br /> | / \===> a<br /> | | |<br /> +----|----+<br /> v <br /> a'</code></pre><p>If we generate a random sequence of them them we just connect them head to toe. For simplicity, I will just assume that all inputs and outputs are of type <code>Int</code>:</p><pre><code>pRandom1 = respond >=> request >=> respond<br /><br /> Int<br /> |<br /> +----|----+<br /> | | |<br /> | \ /==== Int<br /> | X |<br /> | / \===> Int<br /> | | |<br /> +----|----+<br /> v <br /> Int<br /> |<br /> +----|----+<br /> | | |<br />Int <=====/ |<br /> | |<br />Int ======\ |<br /> | | |<br /> +----|----+<br /> v<br /> Int<br /> |<br /> +----|----+<br /> | | |<br /> | \ /==== Int<br /> | X |<br /> | / \===> Int<br /> | | |<br /> +----|----+<br /> v <br /> Int</code></pre><p>When we compose proxies using something like <code>(>+>)</code> (the bidirectional generalization of <code>(>->)</code>), we conceptually place these random chains side by side and match inputs with outputs. For example, if we generate the following two random sequences of <code>request</code> and <code>respond</code>:</p><pre><code>pRandom1 = respond >=> request >=> respond<br />pRandom2 = request >=> request</code></pre><p>... then this generates the following flow of information, beginning from the top-right <code>Int</code>.</p><pre><code>pRandom1 >+> pRandom2:<br /><br /> Int<br /> |<br /> +----|----+<br /> | | |<br /> Int<======================/ |<br /> | | |<br /> +----|----+ /======\ |<br /> | | | / | | |<br /> | \ /==== Int <=\ / +----|----+<br /> | X | X |<br /> | / \===> Int ==/ \ v<br /> | | | | Int<br /> +----|----+ | |<br /> v \ +----|----+<br /> Int \ | | |<br /> | \======/ |<br /> +----|----+ | |<br /> | | | /======\ |<br />... <=====/ | / | | |<br /> | | / +----|----+<br />... ======\ | | v<br /> | | | | Int<br /> +----|----+ |<br /> v |<br /> Int |<br /> | |<br /> +----|----+ |<br /> | | | |<br /> | \ /==== Int /<br /> | X | /<br /> | / \===> Int ==/<br /> | | |<br /> +----|----+<br /> v <br /> Int</code></pre><p>This shows how you can easily generate complex flows of information to exercise the API with just a few simple permutations of <code>request</code> and <code>respond</code>.</p><p>However, notice how the <code>Int</code> flows through the pipeline without interruption until the pipeline terminates. This means that unless we do something else we will always return the same <code>Int</code> we began with.</p><p>So we add two additional actions to our random repertoire which will provide diagnostic information about which paths we took. The first one is <code>inc</code>, which increments the <code>Int</code> flowing through it:</p><pre><code>inc :: (Monad m) => Int -> m Int<br />inc n = return (n + 1)</code></pre><p>The second is <code>log</code>, which stores the current value of the <code>Int</code> using a <code>Writer []</code> base monad and reforwards the <code>Int</code> further along:</p><pre><code>log :: Int -> Proxy a' a b' b (Writer [Int]) Int<br />log n = do<br /> lift (tell [n])<br /> return n</code></pre><p>Now we can easily generate random <code>Proxy</code>s just by assembling random chains of these four actions:</p><pre><code>pRandom = log >=> inc >=> request >=> inc >=> inc >=> ...</code></pre><p>We can use this same trick to generate arbitrary <code>Server</code>s and <code>Client</code>s (i.e. one-sided <code>Proxy</code>s) by just selectively removing either <code>request</code> or <code>respond</code> from the list of permissible actions.</p><p>Once we have random <code>Proxy</code>s, <code>Server</code>s, and <code>Client</code>s, we can test for equality by testing the values that each side returns and logs when given an initial value of <code>0</code>:</p><pre><code>p1 === p2 = run p1 == run p2<br /> where<br /> run p = runWriter (runEffect (p 0))<br /><br />infix 4 ===</code></pre><p>This comparison is pure because the <code>Writer []</code> base monad is pure, so we can pass it as suitable property that <code>QuickCheck</code> can test. Well, almost...</p><p>We also need to be able to <code>Show</code> the randomized values that we selected so that <code>QuickCheck</code> can print any counter-examples it discovers. The solution, though, is pretty simple. We can use an intermediate representation that is just an enumeration. This just stores placeholders for each action in our chain, and these placeholders are <code>Show</code>able:</p><pre><code>-- Place holder for a single `Proxy` action<br />data ProxyStep<br /> = ProxyRequest<br /> | ProxyRespond<br /> | ProxyLog<br /> | ProxyInc<br /> deriving (Enum, Bounded)<br /><br />instance Arbitrary ProxyStep where<br /> arbitrary = arbitraryBoundedEnum<br /> shrink _ = []<br /><br />instance Show ProxyStep where<br /> show x = case x of<br /> ProxyRequest -> "request"<br /> ProxyRespond -> "respond"<br /> ProxyLog -> "log"<br /> ProxyInc -> "inc"<br /><br />-- Place holder for a single `Client` action<br />data ClientStep<br /> = ClientRequest<br /> | ClientLog<br /> | ClientInc<br /> deriving (Enum, Bounded)<br /><br />instance Arbitrary ClientStep where<br /> arbitrary = arbitraryBoundedEnum<br /> shrink _ = []<br /><br />instance Show ClientStep where<br /> show x = case x of<br /> ClientRequest -> "request"<br /> ClientLog -> "log"<br /> ClientInc -> "inc"<br /><br />-- Place holder for a single `Server` action<br />data ServerStep<br /> = ServerRespond<br /> | ServerLog<br /> | ServerInc<br /> deriving (Enum, Bounded)<br /><br />instance Arbitrary ServerStep where<br /> arbitrary = arbitraryBoundedEnum<br /> shrink _ = []<br /><br />instance Show ServerStep where<br /> show x = case x of<br /> ServerRespond -> "respond"<br /> ServerLog -> "log"<br /> ServerInc -> "inc"</code></pre><p>Then we tell <code>QuickCheck</code> to generate random lists of these actions which will form our complete <code>Proxy</code>s, <code>Client</code>s, or <code>Server</code>s:</p><pre><code>correct :: String -> String<br />correct str = case str of<br /> [] -> "return"<br /> _ -> str<br /><br />-- Place holder for a `Proxy`<br />newtype AProxy = AProxy { unAProxy :: [ProxyStep] }<br /><br />instance Arbitrary AProxy where<br /> arbitrary = fmap AProxy arbitrary<br /> shrink = map AProxy . shrink . unAProxy<br /><br />instance Show AProxy where<br /> show = correct . intercalate " >=> " . map show . unAProxy<br /><br />-- Place holder for a `Client`<br />newtype AClient = AClient { unAClient :: [ClientStep] }<br /><br />instance Arbitrary AClient where<br /> arbitrary = fmap AClient arbitrary<br /> shrink = map AClient . shrink . unAClient<br /><br />instance Show AClient where<br /> show = correct . intercalate " >=> " . map show . unAClient<br /><br />-- Place holder for a `Server`<br />newtype AServer = AServer { unAServer :: [ServerStep] }<br /><br />instance Arbitrary AServer where<br /> arbitrary = fmap AServer arbitrary<br /> shrink = map AServer . shrink . unAServer<br /><br />instance Show AServer where<br /> show = correct . intercalate " >=> " . map show . unAServer</code></pre><p>Once <code>QuickCheck</code> generates these randomized lists of actions we can convert them to honest-to-god <code>Proxy</code>s/<code>Server</code>s/<code>Client</code>s for testing purposes using the following conversion functions:</p><pre><code>aProxy<br /> :: AProxy -> Int -> Proxy Int Int Int Int (Writer [Int]) Int<br />aProxy = foldr (>=>) return . map f . unAProxy<br /> where<br /> f x = case x of<br /> ProxyRequest -> request<br /> ProxyRespond -> respond<br /> ProxyLog -> log<br /> ProxyInc -> inc<br /><br />aClient :: AClient -> Int -> Client Int Int (Writer [Int]) Int<br />aClient = foldr (>=>) return . map f . unAClient<br /> where<br /> f x = case x of<br /> ClientRequest -> request<br /> ClientLog -> log<br /> ClientInc -> inc<br /><br />aServer :: AServer -> Int -> Server Int Int (Writer [Int]) Int<br />aServer = foldr (>=>) return . map f . unAServer<br /> where<br /> f x = case x of<br /> ServerRespond -> respond<br /> ServerLog -> log<br /> ServerInc -> inc</code></pre><h4 id="test-driving-pipes">Test-driving <code>pipes</code></h4><p>Combined with the previous <code>(===)</code> function for testing equality, I can now test high-level properties like associativity:</p><pre><code>main = quickCheck $ \p1' p2' p3' -><br /> let p1 = aServer p1'<br /> p2 = aProxy p2'<br /> p3 = aClient p3'<br /> in p1 >+> (p2 >+> p3) === (p1 >+> p2) >+> p3</code></pre><p><code>QuickCheck</code> then generates 100 random test cases and verifies that all of them obey the associativity law:</p><pre><code>>>> main<br />++ OK, passed 100 tests.</code></pre><p>However, this is still not enough. Perhaps my randomization scheme is simply not exercising corner cases sufficiently well. To really convince myself that I have a good randomization scheme I must try a few negative controls to see how effectively <code>QuickCheck</code> uncovers property violations.</p><p>For example, let's suppose that some enterprising young NSA employee were to try to commit a modification to the identity pipe <code>pull'</code> to try to <code>log</code> the second value flowing upstream. We could set up a test case to warn us if the modified <code>pull'</code> function failed to obey the identity law:</p><pre><code>main = quickCheck $ \p1' p2' -><br /> let p1 = aServer p1'<br /><br /> -- The maliciously modified `pull` function<br /> pull' = request >=> respond >=> log >=> pull'<br /><br /> p2 = aClient p2'<br /> in p1 >+> (pull' >+> p2) === p1 >+> p2</code></pre><p><code>QuickCheck</code> would then snoop this out in a hurry and catch the law violation.</p><pre><code>*** Failed! Falsifiable (after 12 tests and 6 shrinks): <br />respond<br />request >=> request</code></pre><p>Not only does <code>QuickCheck</code> detect violations, but it also goes out of its way to minimize the violation to the minimum reproducing test case. In this case, the way you read the <code>QuickCheck</code> output is that the minimal code necessary to trigger the violation is when:</p><pre><code>p1 = respond<br />p2 = request >=> request</code></pre><p>In other words, <code>QuickCheck</code> detects a spurious <code>log</code> on the left-hand side of the equation if <code>p2</code> <code>request</code>s two values and <code>p1</code> <code>respond</code>s with at least one value. Notice that if <code>p1</code> did not <code>respond</code> with at least one value then the left pipeline would terminate before <code>p2</code>'s second request and avoid triggering the <code>log</code> statement.</p><p><code>QuickCheck</code> can do this kind of minimization because of purity. Since our test case is pure, <code>QuickCheck</code> can safely run it repeatedly as it tries to <code>shrink</code> the counter-example, without having to worry that repeated runs will interfere with each other because of side effects or statefulness.</p><p>Here's another example, where we accidentally wrote <code>pull</code> wrong and inserted one extra <code>request</code> too many:</p><pre><code>main = quickCheck $ \p1' p3' -><br /> let p1 = aServer p1'<br /> pull' = request >=> respond >=> request >=> pull<br /> p3 = aClient p3'<br /> in p1 >+> (pull' >+> p3) === p1 >+> p3</code></pre><p><code>QuickCheck</code> immediately pulls out this clever counter-example:</p><pre><code>*** Failed! Falsifiable (after 15 tests and 8 shrinks): <br />respond >=> respond<br />request >=> request >=> inc</code></pre><p>This instructs <code>p1</code> and <code>p2</code> to exchange information twice. This triggers <code>pull'</code> to accidentally <code>request</code> one value too many after the second exchange and terminate early before <code>p2</code> can call <code>inc</code>.</p><p>Examples like these give me confidence that permutations on these four actions suffice to build most useful counter-examples. I could probably even narrow it down to three commands by eliminating <code>log</code>, but for now I will keep it.</p><h4 id="conclusion">Conclusion</h4><p>I believe this is the first example of a useful <code>Arbitrary</code> instance for randomizing a stream programming data type. This allows <code>pipes</code> to test more powerful properties than most stream programming libraries, which would normally settle for randomizing input to the system instead of randomizing the flow of control.</p><p>I want to thank <a href="https://github.com/archblob">Csernik Flaviu Andrei</a> who took the time to write up all the <code>pipes</code> laws into a <code>QuickCheck</code> suite integrated with <code>cabal test</code>. Thanks to him you can now have even greater confidence in the correctness of the <code>pipes</code> library.</p><p>The next step would be to machine-check the <code>pipes</code> laws using Agda, which would provide the greatest assurance of correctness.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com4tag:blogger.com,1999:blog-1777990983847811806.post-40514921928355561212013-10-12T15:51:00.000-07:002013-10-13T08:50:43.959-07:00An all-atom protein search engine powered by Haskell<p>This post discusses a Haskell-based project that is the central component of my thesis: a fast, atomic-level structural search engine named "Suns". I will discuss what problem this search engine solves, why I chose Haskell for this project, and what were the pros and cons of my language choice.</p><br /><h4 id="the-challenge">The Challenge</h4><p>Proteins are amazing molecules that control biology on the molecular level. If you are new to cellular biology, you can get an immediate sense for how diverse and sophisticated they are by watching <a href="http://www.xvivo.net/the-inner-life-of-the-cell/">The Inner Life of the Cell</a>, an artistic rendition of several diverse processes controlled by proteins, including:</p><ul class="incremental"><li>cellular adhesion (how cells recognize and stick to each other)</li><li>forming a skeleton for the cell</li><li>assembling and disassembling cellular highways</li><li>trafficking cargo along those highways</li></ul><p>In addition, proteins also:</p><ul class="incremental"><li>sense the molecular environment</li><li>perform chemistry</li><li>are cellular weapons</li><li>and much, much more!</li></ul><p>I am a graduate student in a research lab that designs new proteins for both medical treatments and scientific study. Unfortunately, proteins are significantly harder to design than software. A typical protein can be very chemically complex and tightly integrated:</p> <div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-V1joCreA3Q0/UlnLfO7hx6I/AAAAAAAAAGU/1EQFUhVV-KI/s1600/spaghetti.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-V1joCreA3Q0/UlnLfO7hx6I/AAAAAAAAAGU/1EQFUhVV-KI/s320/spaghetti.png" /></a><p class="caption">A protein structure in atomic detail</p></div> <p>Proteins are tangled ropes of amino acids, so they are the epitome of spaghetti code. However, it's not as bad as it sounds. Within these tangled balls we see several patterns emerge, both on small, medium, and large scales.</p><p>On a larger scale you have "protein domains". I liken these to high-level scripts: self-contained blobs of functionality with little coupling or state. They are often easy to connect together to generate new functionality:</p> <div class="separator" style="clear: both; text-align: center;"><a href="http://3.bp.blogspot.com/-JXhutleiGDo/UlnLkTcLqdI/AAAAAAAAAGw/f0awGohyp1Q/s1600/domains.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://3.bp.blogspot.com/-JXhutleiGDo/UlnLkTcLqdI/AAAAAAAAAGw/f0awGohyp1Q/s320/domains.png" /></a><p class="caption">Surface representation of two connected protein domains, colored green and cyan</p></div> <p>On the medium scale you have "secondary structure", consisting primarily of alpha helices and beta strands, which form locally repetitive structures. I liken these to C code: low-level, but still reasonably portable. Our lab has historically had success transplanting patterns in secondary structure to new contexts, generating new functionality.</p> <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-9qZ6Vn0JwLU/UlnLkXdbIMI/AAAAAAAAAG0/0_bYG-Bg8Gc/s1600/secondary.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-9qZ6Vn0JwLU/UlnLkXdbIMI/AAAAAAAAAG0/0_bYG-Bg8Gc/s320/secondary.png" /></a><p class="caption">Secondary structure representation of a protein composed primarily of beta strands</p></div> <p>On the small scale you have "motifs", small configurations of a few amino acids that frequently occur in natural proteins. I liken these to assembly code: they are very tightly integrated and pack a lot of utility into a small space, but they are not very portable because they have very precise chemical and geometric requirements and little room for customization:</p> <div class="separator" style="clear: both; text-align: center;"><a href="http://4.bp.blogspot.com/-lR-TWhBwTb8/UlnLkIJGGUI/AAAAAAAAAGk/acrA1ihT-kY/s1600/motif.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://4.bp.blogspot.com/-lR-TWhBwTb8/UlnLkIJGGUI/AAAAAAAAAGk/acrA1ihT-kY/s320/motif.png" /></a><p class="caption">An example motif, with a geometrically well-defined hydrogen-bonding pattern</p></div> <p>Our lab needs to design proteins on this small and compact scale, so they use my search engine to discover and incorporate these small motifs into their designed proteins. The search engine overcomes the "portability issues" of these low-level motifs by finding existing motifs that are exactly geometrically and chemically compatible with partial blueprints.</p><h4 id="overview">Overview</h4><p>The most popular interface to the search engine is through <code>PyMOL</code>, a molecular graphics system that lets you visualize protein structures three-dimensionally. You can find a <code>PyMOL</code> search plugin on the official <a href="http://www.degradolab.org/suns/">Suns web site</a>, which also includes an extended tutorial for how to use the search plugin.</p><p>The search plugin lets you point and click on multiple protein fragments and then click "Search". Typically within less than a second it will stream up to 100 matches to the fragment of interest, superimposed on your original search query. You can then expand on your original query by incorporating new fragments from the search results:</p> <div class="separator" style="clear: both; text-align: center;"><a href="http://2.bp.blogspot.com/-1GDxpWcRYA4/UlnLj-_ADhI/AAAAAAAAAGg/7hayJgEwQGo/s1600/Searching.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="http://2.bp.blogspot.com/-1GDxpWcRYA4/UlnLj-_ADhI/AAAAAAAAAGg/7hayJgEwQGo/s320/Searching.png" /></a><p class="caption">An exmple iterative design process that begins from a guanidinium group and which grows to an N-terminal helix-capping motif. Black dashes highlight fragments incorporated from search results.</p></div> <p>The <code>PyMOL</code> plugin is a thin Python-based client that connects to a remote search backend written in Haskell.</p><br /><h4 id="why-haskell">Why Haskell?</h4><p>I initially did not begin this project in Haskell. I actually began it in C. The reason why was that at the time I did not have a solid background in data structures and algorithms, so I would reflexively turn to C for high-performance projects to try to make up for it by improving the constant factors of my code. However, all-atom searching is the kind of problem where the algorithm matters and the bare-metal performance does not. My initial implentation in C attempted to brute force the solution to no avail, taking hours to perform a single search.</p><p>The second problem that I encountered was that when I got the first slow draft working I immediately got additional feature requests and I was not prepared to handle them. C is a very brittle language to program in because of manual memory management and no standard library of data structures. However, like many inexperienced programmers I had trouble saying "no" and bit off more than I could chew trying (and failing) to satisfy these feature requests.</p><p>At that time I had been dabbling in Haskell in my free time and took interest in the language. Things that intrigued me about the language included:</p><ul class="incremental"><li>expressive and strong types</li><li>an elegant mix of imperative and functional style (using monads)</li><li>high-level idioms and design patterns</li></ul><p>Plus, Haskell is garbage collected and has a rich library of data structures, which solved my two main qualms with C, so I decided to rewrite my code base in Haskell.</p><p>This rewrite was not a simple translation of the original C code. One of the main reasons that I transitioned was that I wanted to familiarize myself with improved data structures and algorithms. Haskell made it very easy to prototype new code for several reasons:</p><ul class="incremental"><li>You have a rich set of high-performance purely functional data structures in the <code>containers</code>, <code>unordered-containers</code>, and <code>vector</code> libraries</li><li>Garbage collection, type inference, and type classes automate common menial tasks</li><li>Strong static types make it a breeze to refactor code without having to write tests</li></ul><p>Consequently, Haskell gave me much greater confidence to rewrite large swaths of code because the compiler caught more of my mistakes and automated more details. This encouraged me to try out new ideas and rapidly gravitate towards a good solution instead of getting complacent with my first working solution. These changes improved performance and now queries took milliseconds instead of hours. At this point, <code>PyMOL</code> became the bottleneck and could not load search results fast enough, so there was no point improving performance further.</p><p>Another major advantage of Haskell for my project is parsing. Haskell libraries excel at parsing by necessity: there is a huge pressure to convert untyped textual data to more strongly typed data structures in order to take advantage of the type system. I want to give a special shoutout to <code>attoparsec</code> and <code>aeson</code>, which are fantastic for this purpose.</p><p>Another big benefit of programming in Haskell was that my code was remarkably stable and devoid of bugs. This meant that I spent less time maintaining the code and more time adding features.</p><br /><h4 id="problems-with-haskell">Problems with Haskell</h4><p>The first issue I encountered is that if you program in Haskell you need to be prepared to fill in the missing gaps in the library ecosystem. I began at a time when the Haskell ecosystem was beginning to mature, but there were still several important gaps. The foremost of these was a stream programming library, which I needed to write myself because I was unsatisfied with existing iteratee-based solutions, and this eventually evolved into my <code>pipes</code> stream programming library. If you are considering starting a large project in Haskell you need to be prepared to do some library work of your own.</p><p>Another issue was numerical code. This is limited almost exclusively to <code>hmatrix</code> (Haskell bindings to the GNU scientific library) and <code>repa</code> (a fast and native Haskell library, but without a standard library of numerical algorithms). <code>hmatrix</code> was problematic because it is GPLv3 licensed, and my university, UCSF, forbids releasing code with a GPLv3 license because of the patent clause. This meant that I was forced to write my own LAPACK bindings for performance-critical code. Haskell's foreign-function interface (FFI) is nice, but not having to use it at all is even nicer.</p><p>Next, space leaks cause problems. Space leaks are like the Haskell analog to C's segmentation faults: they require programmer discipline to prevent and they are not easy to debug. With experience I learned how to avoid them, mainly by:</p><ul class="incremental"><li>Defining data types with strict fields</li><li>Ensuring all folds are strict in the accumulator</li></ul><p>... but this is still a problem because it bites beginners to the language really early. Also, space leaks are the biggest source of program instability, which is a shame because it compromises what would be an otherwise rock-solid stability story for the language.</p><p>The next problem is the lack of standardized error-handling idioms. I ended up writing my own library for this, too, called the <code>errors</code> library, but this did not completely solve the problem either. I'm currently collaborating with John Wiegley and Michael Snoyman (of <a href="http://www.yesodweb.com/">Yesod</a> fame) to try to solve this problem.</p><p>Another problem is the pre-existing ecosystem of functionality dependent on lazy <code>IO</code>. I'm not going to name any names, but one library that I depend on uses lazy <code>IO</code> internally and it is still a source of bugs. Lazy <code>IO</code> causes these sort of problems because it does not guarantee ordering of effects and also results in synchronous exceptions being thrown asynchronously in pure code (which is a big no-no in the Haskell world).</p><br /><h4 id="conclusions">Conclusions</h4><p>The code for this project is open source and hosted on Github. It consists of three separate projects:</p><ul class="incremental"><li><a href="https://github.com/Gabriel439/suns-search">The search engine</a></li><li><a href="https://github.com/godotgildor/Suns">The PyMOL search plugin</a></li><li><a href="https://github.com/Gabriel439/suns-cmd">The command line search tool</a></li></ul><p>You can also learn more about the project on the <a href="http://www.degradolab.org/suns/">official Suns website</a>.</p><p>People interested in learning Haskell may find the code for the search engine useful to see what a medium-sized Haskell application looks like. Haskell gets a bad reputation for being overly academic and I hope people can see past that to discover an excellent programming language for rapid development with high reliability.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com11tag:blogger.com,1999:blog-1777990983847811806.post-48247841373377889452013-10-09T16:56:00.001-07:002013-10-23T16:41:50.703-07:00How to reimplement the conduit parsing API in 50 lines of pipes code<p>Michael's recent blog posts highlighted several deficiences of <code>pipes</code>-based parsing. Particularly, he emphasized that it was incompatible with idioms from the core <code>pipes</code> library, and I agree with that assessment. Programming with <code>pipes-parse</code> is a different beast from programming with vanilla <code>pipes</code> and <code>pipes-parse</code> idioms more closely resemble <code>conduit</code> idioms.</p><p>Several comments in response to Michael's post asked if one could internally implement <code>conduit</code> on top of <code>pipes</code>, in order to simplify <code>conduit</code> internals. This post answers half that question by showing how to implement conduit sans finalization on top <code>pipes</code> using the tools from <code>pipes-parse</code>.</p><p>This code is short, but very dense, so I will walk through the implementation step-by-step, explaining the underlying <code>pipes-parse</code> idioms that I'm using to reconstruct conduit functionality. If you just want to skip to the complete code then go straight to the Appendix at the end of this post.</p><br /><h4>The Conduit Type</h4><p>The way you internally represent a conduit-like parser using <code>pipes</code> is the following data type:</p><pre><code>import Pipes<br />import Control.Monad.Trans.State.Strict<br /><br />newtype ConduitM i o m r = ConduitM<br /> { unConduitM ::<br /> forall x . Producer o (StateT (Producer i m x) m) r }</code></pre><p>To recap, a <code>ConduitM i o m r</code> has an input of type <code>i</code> and an output of type <code>o</code>, and the output is distinct from the return value, just like <code>pipes</code>.</p><p>I model this as a <code>Producer</code> of <code>o</code>s that reads from and writes to a <code>Producer</code> of <code>i</code>s. The <code>Producer</code> of <code>i</code>s is our conduit's upstream input end. <code>await</code>ing a value will pop an elements off of this <code>Producer</code> and adding back a <code>leftover</code> pushes an element back onto this <code>Producer</code>.</p><p>This representation differs from conduit's implementation in one important way: it makes no distinction between leftovers and future input. Both are stored together within the inner <code>Producer</code>. This is one neat trick of reifying future input as a <code>Producer</code>: you now have an obvious place to store leftovers.</p><br/><h4>Primitives</h4><p>The next step is to implement <code>await</code>, which is just a thin wrapper around <code>draw</code> from <code>pipes-parse</code>:</p><pre><code>type Consumer i m r = forall o . ConduitM i o m r<br /><br />await :: (Monad m) => Consumer i m (Maybe i)<br />await = ConduitM $ lift $ liftM f draw<br /> where<br /> f (Left _) = Nothing<br /> f (Right r) = Just r</code></pre><p>However, this doesn't explain what <code>draw</code> is actually doing, so let's refer to its implementation:</p><pre><code>draw = do<br /> p <- get -- Retrieve our source of input<br /> x <- lift (next p) -- Attempt to pop one element off<br /> case x of<br /> Right (a, p') -> do -- Success: bind the element<br /> put p' -- Save the remainder<br /> return (Right a)<br /> Left r -> do -- Failure: No more input<br /> put (return r)<br /> return (Left r)</code></pre><p>If you are more comfortable with <code>StateT</code> you might prefer the following less verbose form which inlines all the state passing:</p><pre><code>draw = StateT $ \p -> do<br /> x <- next p<br /> return $ case x of<br /> Right (a, p') -> (Right a, p' )<br /> Left r -> (Left r, return r)</code></pre><p>If you think of a <code>Producer a m ()</code> as isomorphic to <code>ListT m a</code>, then <code>next</code> is the equivalent of <code>uncons</code> for <code>ListT</code>.</p><p>Similarly, we can add elements back to the producer, using <code>leftover</code>, which is just a thin wrapper around <code>unDraw</code> from <code>pipes-parse</code>:</p><pre><code>leftover :: (Monad m) => i -> ConduitM i o m ()<br />leftover i = ConduitM $ lift $ unDraw i</code></pre><p><code>unDraw</code> has a simple implementation:</p><pre><code>unDraw a = modify (Pipes.yield a >>)</code></pre><p>It just prepends a <code>yield</code> statement onto the <code>Producer</code>. This is the equivalent of <code>cons</code> for <code>ListT</code>.</p><p>What about <code>yield</code>? Well, that's exactly the same:</p><pre><code>yield :: (Monad m) => o -> ConduitM i o m ()<br />yield o = ConduitM (Pipes.yield o)</code></pre><br /><h4 id="composition">Composition</h4><p>Now for the interesting part: conduit composition, which has the following type:</p><pre><code>(=$=) :: (Monad m)<br /> => ConduitM a b m ()<br /> -> ConduitM b c m r<br /> -> ConduitM a c m r</code></pre><p>If we were to replace these <code>ConduitM</code>s with the underlying pipe type, we would get:</p><pre><code>(=$=) :: (Monad m)<br /> => forall x . Producer b (StateT (Producer a m x) m) ()<br /> -> forall y . Producer c (StateT (Producer b m y) m) r<br /> -> forall z . Producer c (StateT (Producer a m z) m) r</code></pre><p>How do we even begin to approach that?</p><p>The key is the <code>runStateP</code> function from <code>Pipes.Lift</code>, which has the following (simplified) type:</p><pre><code>runStateP<br /> :: s -> Producer a (StateT s m) r -> Producer a m (r, s)</code></pre><p>Compare this with the type for <code>runStateT</code>:</p><pre><code>runStateT :: StateT s m r -> s -> m (r, s)</code></pre><p><code>runStateP</code> differs from <code>runStateT</code> in two ways:</p><ul class="incremental"><li><p><code>runStateP</code> unwraps a <code>StateT</code> buried inside of a pipe</p></li><li><p><code>runStateP</code> takes arguments in the opposite order from <code>runStateT</code></p></li></ul><p><code>runStateP</code> takes care to thread state as it wraps the internal <code>StateT</code> so it behaves just like <code>runStateT</code>. Once you familiarize yourself with how <code>runStateP</code> works, the solution is a matter of type-chasing. In fact, what you will discover is that if you restrict yourself to <code>runStateP</code>, there is only one solution that type-checks.</p><p>We begin with two arguments two our operator:</p><pre><code>ConduitM pL =$= ConduitM pR = ConduitM $ ...</code></pre><p><code>pL</code> has type:</p><pre><code>pL :: forall x . Producer b (StateT (Producer a m x) m) ()</code></pre><p>Let's look at what type we get when we unwrap <code>pL</code> using <code>runStateP</code>:</p><pre><code>parseL<br /> :: (Monad m)<br /> => Producer a m x -> Producer b m ((), Producer a m x)<br />parseL as = runStateP as pL</code></pre><p>This now looks just like a parser combinator. It takes an input stream of values of type <code>a</code> and generates an output stream of values of type <code>b</code>, returning unused input alongside the <code>()</code> return value. We're not interested in this <code>()</code> return value, so we'll use <code>execStateP</code> instead:</p><pre><code>parseL<br /> :: (Monad m)<br /> => Producer a m x -> Producer b m (Producer a m x)<br />parseL as = execStateP as pL</code></pre><p>Similarly, we'll convert <code>pR</code> to a parser:</p><pre><code>parseR<br /> :: (Monad m)<br /> => Producer b m y -> Producer c m (r, Producer b m y)<br />parseR bs = runStateP bs pR</code></pre><p>Now what's our goal? We're trying to build a <code>ConduitM a c m r</code>, which is equivalent to the following parser:</p><pre><code>parse<br /> :: (Monad m)<br /> => Producer a m z -> Producer c m (r, Producer a m z)</code></pre><p>This means that we need to introduce a stream of <code>as</code>:</p><pre><code>parse as = do<br /> -- as :: Producer a m x</code></pre><p>We can now feed that stream to <code>parseL</code></p><pre><code>parse as = do<br /> -- as :: Producer a m x<br /> -- parseL as :: Producer b m (Producer a m x)</code></pre><p>We can then feed <em>that</em> to <code>parseR</code>. This works because <code>parseR</code> is universally quantified in <code>y</code>, which type-checks as <code>Producer a m x</code>:</p><pre><code>parse as = do<br /> -- as :: Producer a m x<br /> -- parseL as<br /> -- :: Producer b m (Producer a m x)<br /> -- parseR (parseL as)<br /> -- :: Producer c m (r, Producer b m (Producer a m x))</code></pre><p>This is almost what we want. We just need to discard the unused stream of <code>b</code>s:</p><pre><code>parse as = do<br /> (r, pp) <- parseR (parseL as)<br /> p' <- lift $ drain pp<br /> return (r, p')<br /> where<br /> drain p = runEffect (for p discard)</code></pre><p>If we inline all of that logic, we get the following 5-line implementation of <code>conduit</code> composition:</p><pre><code>ConduitM pL =$= ConduitM pR = ConduitM $<br /> stateP $ \as -> do<br /> (r, pp) <- runStateP (execStateP as pL) pR<br /> p' <- lift $ drain pp<br /> return (r, p')</code></pre><p>This gives a birds-eye view of how <code>conduit</code> composition works. When we compose two conduits, we:</p><ul class="incremental"><li>Feed the input stream of <code>as</code> to the upstream conduit</li><li>Feed that to the downstream conduit</li><li>Discard all leftovers from their intermediate interface</li></ul><p>Once we have this composition operator, the right and left fuse are just type-specializations (the same as in <code>conduit</code>):</p><pre><code>type Conduit i m o = ConduitM i o m ()<br />type Source m o = ConduitM () o m ()<br />type Sink i m r = ConduitM i Void m r<br /><br />($=) :: (Monad m) => Source m a -> Conduit a m b -> Source m b<br />($=) = (=$=)<br /><br />(=$) :: (Monad m) => Conduit a m b -> Sink b m c -> Sink a m c<br />(=$) = (=$=)</code></pre><p>What about <code>($$)</code>? That is even simpler:</p><pre><code>empty :: (Monad m) => Producer () m r<br />empty = return () >~ cat -- equivalent to "forever $ yield ()"<br /><br />($$) :: (Monad m) => Source m a -> Sink a m b -> m b<br />ConduitM pL $$ ConduitM pR =<br /> evalStateT (runEffect pR) (evalStateP empty pL)</code></pre><p>This implementation says at a high-level:</p><ul class="incremental"><li>Feed an unused leftover stream to <code>pL</code> (unused because it's a <code>Source</code>)</li><li>Feed that to <code>pR</code></li><li>There is no step 3</li></ul><br /><h4 id="identity">Identity</h4><p>If that is composition, what is the identity? Why, it's just <code>input</code> from <code>pipes-parse</code>:</p><pre><code>idP :: (Monad m) => ConduitM a a m ()<br />idP = ConduitM (void input)</code></pre><p>Neat how that works out. This is equivalent in behavior to:</p><pre><code>idP = do<br /> ma <- await<br /> case ma of<br /> Nothing -> return ()<br /> Just a -> do<br /> yield a<br /> idP</code></pre><br /><h4 id="connect-and-resume">Connect and Resume</h4><p>Last but not least we need connect and resume. Like I said before, this will ignore finalization concerns, so I will only implement a variation on <code>($$+)</code> that returns a new <code>Source</code>, rather than a <code>ResumableSource</code> (which is just a <code>Source</code> tagged with a finalizer).</p><pre><code>($$+)<br /> :: (Monad m)<br /> => Source m a -> Sink a m b -> m (b, Source m a)<br />ConduitM pL $$+ ConduitM pR = do<br /> (b, as) <- runEffect $ runStateP (execStateP empty pL) pR<br /> let as' = ConduitM $ stateP $ \p -> ((), p) <$ as<br /> return (b, as')</code></pre><p>This says:</p><ul class="incremental"><li>Feed an unused input stream to <code>pL</code> (it's a <code>Source</code>)</li><li>Feed that to <code>pR</code></li><li>Discard <code>pR</code>'s inexistent output (it's a <code>Sink</code>)</li><li>Create a new <code>Source</code> that also ignores its input stream</li></ul><br /><h4 id="conclusion">Conclusion</h4><p>The purpose of this post is not to suggest that Michael necessarily should implement <code>conduit</code> in terms of <code>pipes</code>, especially since this does not contain finalization code, yet. Rather, I wanted to exhibit that <code>pipes</code> is a powerful tool that you can use to build other abstractions concisely and with less room for error.</p><h4 id="appendix">Appendix</h4><p>The minimal test implementation is 50 lines of code, which I've included here:</p><pre><code>{-# LANGUAGE RankNTypes #-}<br /><br />import Control.Applicative ((<$))<br />import Control.Monad (liftM, void)<br />import Pipes hiding (await, yield, Consumer)<br />import qualified Pipes<br />import Pipes.Lift<br />import Pipes.Parse<br /><br />newtype ConduitM i o m r = ConduitM<br /> { unConduitM :: forall x .<br /> Producer o (StateT (Producer i m x) m) r }<br /><br />instance (Monad m) => Monad (ConduitM i o m) where<br /> return r = ConduitM (return r)<br /> m >>= f = ConduitM $ unConduitM m >>= \r -> unConduitM (f r)<br /><br />instance MonadTrans (ConduitM i o) where<br /> lift m = ConduitM (lift (lift m))<br /><br />type Consumer i m r = forall o . ConduitM i o m r<br />type Source m o = ConduitM () o m ()<br />type Sink i m r = ConduitM i Void m r<br />type Conduit i m o = ConduitM i o m ()<br /><br />await :: (Monad m) => Consumer i m (Maybe i)<br />await = ConduitM $ lift $ liftM f draw<br /> where<br /> f (Left _) = Nothing<br /> f (Right r) = Just r<br /><br />yield :: (Monad m) => o -> ConduitM i o m ()<br />yield o = ConduitM (Pipes.yield o)<br /><br />leftover :: (Monad m) => i -> ConduitM i o m ()<br />leftover i = ConduitM $ lift $ unDraw i<br /><br />(=$=)<br /> :: (Monad m)<br /> => Conduit a m b<br /> -> ConduitM b c m r<br /> -> ConduitM a c m r<br />ConduitM pL =$= ConduitM pR = ConduitM $<br /> stateP $ \p -> do<br /> (r, pp) <- runStateP (execStateP p pL) pR<br /> p' <- lift $ drain pp<br /> return (r, p')<br /><br />drain :: (Monad m) => Producer a m r -> m r<br />drain p = runEffect (for p discard)<br /><br />($=) :: (Monad m) => Source m a -> Conduit a m b -> Source m b<br />($=) = (=$=)<br /><br />(=$) :: (Monad m) => Conduit a m b -> Sink b m c -> Sink a m c<br />(=$) = (=$=)<br /><br />empty :: (Monad m) => Producer () m r<br />empty = return () >~ cat<br /><br />($$) :: (Monad m) => Source m a -> Sink a m b -> m b<br />ConduitM pL $$ ConduitM pR =<br /> evalStateT (runEffect pR) (evalStateP empty pL)<br /><br />idP :: (Monad m) => ConduitM a a m ()<br />idP = ConduitM (void input)<br /><br />($$+)<br /> :: (Monad m)<br /> => Source m a -> Sink a m b -> m (b, Source m a)<br />ConduitM pL $$+ ConduitM pR = do<br /> (b, pa) <- runEffect $ runStateP (execStateP empty pL) pR<br /> let p' = ConduitM $ stateP $ \p -> ((), p) <$ pa<br /> return (b, p')</code></pre>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com2tag:blogger.com,1999:blog-1777990983847811806.post-47276955599604715482013-10-06T17:12:00.002-07:002013-10-06T17:12:40.451-07:00Manual proofs for the `pipes` lawsOut of all of Haskell's streaming libraries, <code>pipes</code> is the only that does not have a test suite. This is because I prefer to use equational reasoning to prove the correctness of <code>pipes</code>. For those new to Haskell, equational reasoning is the use of equations to prove code is correct without ever running it. Haskell makes this possible by strictly enforcing purity.<br /><br />Equational reasoning works well for a library like <code>pipes</code> because most properties of interest can easily be specified as category theory laws. If you are unfamiliar with category theory, you can read <a href="http://www.haskellforall.com/2012/08/the-category-design-pattern.html">this short primer</a> to learn how it relates to programming.<br /><br />Every primitive in the <code>pipes</code> library is related to a category in some way: <ul><li> <code>yield</code> and <code>(~>)</code> are the identity and composition operator of a category <li> <code>await</code> and <code>(>~)</code> are the identity and composition operator of a category <li> <code>cat</code> and <code>(>->)</code> are the identity and composition operator of a category </ul>This means that we state the expected behavior for all six primitives just by stating their category laws: <pre><br />-- Left Identity<br />yield ~> f = f<br /><br />-- Right Identity<br />f ~> yield = f<br /><br />-- Associativity<br />(f ~> g) ~> h = f ~> (g ~> h)<br /><br /><br />-- Left Identity<br />await >~ f = f<br /><br />-- Right Identity<br />f ~> await = f<br /><br />-- Associativity<br />(f >~ g) >~ h = f >~ (g >~ h)<br /><br /><br />-- Left Identity<br />cat >-> f = f<br /><br />-- Right Identity<br />f >-> cat = f<br /><br />-- Associativity<br />(f >-> g) >-> h = f >-> (g >-> h)<br /></pre>Category theory laws are like tests: they are properties that we expect our code to obey if we wrote it correctly. These kinds of laws tend to have the nice property that they tend to uncannily summarize many useful properties we would like to know in a remarkably small set of equations. For example, <code>(~>)</code> is defined in terms of <code>for</code> like this: <pre><br />(f ~> g) x = for (f x) g<br /></pre>When you restate the category laws for <code>yield</code> and <code>(~>)</code> in terms of <code>for</code>, you get the "for loop laws": <pre><br />-- Looping over a single yield simplifies to function<br />-- application<br /><br />for (yield x) $ \e -> do<br /> f e<br /><br />= f x<br /><br /><br />-- Re-yielding every element of a generator returns the<br />-- original generator<br /><br />for gen $ \e -> do<br /> yield e<br /><br />= gen<br /><br /><br />-- Nested for loops can become a sequential for loops if the<br />-- inner loop body ignores the outer loop variable<br /><br />for gen0 $ \i -> do<br /> for (f i) $ \j -> do<br /> g j<br /><br />= for gen1 $ \j -> do<br /> g j<br /> where<br /> gen1 = for gen0 $ \i -> do<br /> f i<br /></pre>These are common sense laws that we would expect any language with generators and for loops to obey. Amazingly, the for loop laws emerge naturally from category laws, almost as if they were handed down to us on stone tablets.<br /><br />For a while I've been writing these proofs out by hand informally in notebooks, but now that I'm writing up my thesis I needed to organize my notes and fill in all the little steps that I would previously skip over. You can find the full set of my notes as an organized markdown file that I've committed to the <code>pipes</code> repository that contains all my manual <a href="https://github.com/Gabriel439/Haskell-Pipes-Library/blob/master/laws.md">proofs of the pipe laws</a>.<br /><br /><br /><h4>Caveats</h4><br />There are several caveats that I need to state about these proofs:<br /><br />First, there is the obvious caveat that these proofs are still not machine-checked, but that's the next step. <a href="http://paolocapriotti.com/">Paolo Capriotti</a> has been patiently teaching me how to encode my reasoning into Agda. However, I believe this should be possible and that the proofs are very close to sound, taking into account the following caveats below.<br /><br />Second, these proofs do not account for bottom (i.e. <code>_|_</code>). My intuition is that the <code>pipes</code> laws do hold even in the presence of bottom, by I am still new to reasoning about bottom, so I omitted that for now. If anybody can point me to a good reference on this I would be grateful.<br /><br />Third, these proofs use a non-standard coinductive reasoning. To explain why, it's important to note that typical coinductive reasoning in Haskell requires the result to be productive and to guard coinduction behind the production of a constructor. However, not all pipelines are productive, as the following examples illustrate: <pre><br />infinite = forever $ yield ()<br /><br />infinite >-> forever await = _|_<br /><br />for infinite discard = _|_<br /></pre>So instead of using productivity to impose a total ordering on coinductive reasoning I use "consuming" to order my coinductive proofs, meaning that all uses of coinduction are guarded behind the consumption of a constructor. In other words, my proofs do not guarantee that pipe composition makes progress in production, but they do guarantee that pipe composition makes progress in consumption.<br /><br />Fourth, another departure from typical coinductive proofs is notation. Coinductive reasoning uses bisimilarity to prove "equality". I also use bisimilarity, but notationally I just write it down as a single chain of equations that is bisected by a reference to coinduction. The two halves of the equation chain, above and below the bisection, constitute the two bisimilar halves.<br /><br />Fifth, proofs notationally discharge other proofs using headers as the namespace. If you see a reference like <code>[Kleisli Category - Left Identity Law - Pointful]</code>, that means you should look under top-level header <code>Kleisli Category</code> for sub-header <code>Left Identity Law</code> and sub-sub-header <code>Pointful</code>.<br /><br />Sixth, I've only proven the laws for the bidirectional API in the <code>Pipes.Core</code> module. However, the category laws for the unidirectional API are very easy to derive from the laws for the bidirectional API, so I leave those as an exercise until I have time to add them myself.<br /><br /><br /><h4>Conclusions</h4><br />My goal is to eventually machine-check these proofs so that <code>pipes</code> will be a practical and instructive example of a library with a statically verified kernel. Moreover, I hope that these proofs will allow other people to equationally reason about even more sophisticated systems built on top of <code>pipes</code> and prove higher-level properties by discharging the proofs that I have assembled for them.Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com8tag:blogger.com,1999:blog-1777990983847811806.post-8151023294993569062013-09-20T07:00:00.000-07:002013-09-20T07:36:42.145-07:00Perfect streaming using `pipes-bytestring`<p><code>pipes-bytestring-1.0.0</code> is complete, providing <code>pipes</code> utilities for reading and writing effectul streams of <code>ByteString</code>s. Most people who have been waiting on this library were mainly interested in the following four pipes:</p><pre><code>stdin :: MonadIO m => Producer' ByteString m ()<br /><br />stdout :: MonadIO m => Consumer' ByteString m ()<br /><br />fromHandle :: MonadIO m => Handle -> Producer' ByteString m ()<br /><br />toHandle :: MonadIO m => Handle -> Consumer' ByteString m r</code></pre><p>However, I delayed releasing <code>pipes-bytestring</code> for so long because there was a separate problem I wanted to solve first, which I like to call the <em>Perfect Streaming Problem</em>.</p><h4 id="the-problem">The Problem</h4><p>Here is the statement of the <em>Perfect Streaming Problem</em>:</p><blockquote><p>How do you partition an effectful stream into logical units while strictly enforcing an upper bound on memory use?</p></blockquote><p>This problem will resonate with anybody who has ever built a streaming web service: how do you group your stream of bytes into logical units like files or messages? If you naively load each file or message into memory then a malicious input that is one extremely long file or message will take down your server by exhausting all available memory. Most web services work around this by imposing artificial limits on file or message length, or by writing brittle and non-compositional code or protocols that are difficult to understand, debug, and maintain.</p><p>However, we don't have to invoke web services to illustrate this problem. Just consider this simpler problem: how do you implement the Unix <code>head</code> utility in Haskell so that it runs in constant space on any input, including infinite inputs without any line boundaries like <code>/dev/zero</code>:</p><pre><code>$ head < /dev/zero > /dev/null # This must run in constant space</code></pre><p>Let's keep it simple and assume that our <code>head</code> utility only needs to forward exactly 10 lines from <code>stdin</code> and to <code>stdout</code>.</p><p>Before <code>pipes-bytestring</code>, the only way to do this was using lazy <code>IO</code>, like this:</p><pre><code>takeLines :: String -> String<br />takeLines n = unlines . take n . lines<br /><br />main = do<br /> str <- getContents<br /> putStr (takeLines 10 str)</code></pre><p>However, <code>pipes</code> and other streaming libraries were built to replace lazy <code>IO</code> so that Haskell programmers could easily reason about the order of effects and decouple them from Haskell's evaluation order. Yet, this simple <code>head</code> challenge remains a perennial problem for streaming libraries. To illustrate why, let's consider how <code>conduit</code> does this, using <code>Data.Conduit.Text.lines</code>:</p><pre><code>lines :: Monad m => Conduit Text m Text</code></pre><p>This conduit receives a stream of <code>Text</code> chunks of arbitrary size and outputs <code>Text</code> chunks one line long. This sounds reasonable until you feed your program <code>/dev/zero</code>. The <code>lines</code> conduit would then attempt to load an infinitely large chunk of zeroes as the first output and exhaust all memory.</p><p>Michael Snoyman already realized this problem and added the <code>linesBounded</code> conduit as a temporary workaround:</p><pre><code>linesBounded :: MonadThrow m => Int -> Conduit Text m Text</code></pre><p><code>linesBounded</code> throws an exception when given a line that exceeds the specified length. This is the same workaround that web services use: artificially limit your input stream. I wasn't satisfied with this solution, though. How can I claim that <code>pipes</code> is a mature replacement for lazy <code>IO</code> if lazy <code>IO</code> soundly beats <code>pipes</code> on something as simple as <code>head</code>?</p><h4 id="the-solution">The Solution</h4><p>I will introduce the solution by beginning from the final code:</p><blockquote>Note: There was a last minute bug which I introduced in the <code>lines</code> function before release. Use <code>pipes-bytestring-1.0.1</code>, which fixes this bug, to run this example.</blockquote><pre><code>-- head.hs<br /><br />import Pipes<br />import Pipes.ByteString<br />import Pipes.Parse (takeFree)<br />import Prelude hiding (lines, unlines)<br /><br />takeLines<br /> :: (Monad m)<br /> => Int<br /> -> Producer ByteString m () -> Producer ByteString m ()<br />takeLines n = unlines . takeFree n . lines<br /><br />main = runEffect $ takeLines 10 stdin >-> stdout</code></pre><p>Compile and run this to verify that it takes the first ten lines of input for any file:</p><pre><code>$ ./head < head.hs<br />-- head.hs<br /><br />import Pipes<br />import Pipes.ByteString<br />import Pipes.Parse (takeFree)<br />import Prelude hiding (lines, unlines)<br /><br />takeLines<br /> :: (Monad m)<br /> => Int</code></pre><p>... while still handling infinitely long lines in constant space:</p><pre><code>$ ./head < /dev/zero >/dev/null # Runs forever in constant space</code></pre><p>To see why this works, first take a look at the type of <code>Pipes.ByteString.lines</code>:</p><pre><code>lines<br /> :: (Monad m)<br /> => Producer ByteString m ()<br /> -> FreeT (Producer ByteString m) m ()</code></pre><p>Now compare to that the type of <code>Data.ByteString.Lazy.Char8.lines</code>:</p><pre><code>lines :: ByteString -> [ByteString]</code></pre><p><code>pipes</code> treats a <code>Producer</code> of <code>ByteString</code>s as the effectful analog of a lazy <code>ByteString</code>:</p><pre><code>-- '~' means "is analogous to"<br />Producer ByteString m () ~ ByteString</code></pre><p>Similarly, <code>pipes</code> also treats a <code>FreeT</code> of <code>Producer</code>s as the effectful analog of a list of lazy <code>ByteString</code>s:</p><pre><code>FreeT (Producer ByteString m) m () ~ [ByteString]</code></pre><p>You can think of <code>FreeT</code> as a "linked list" of zero or more <code>Producer</code>s, where each <code>Producer</code>'s return value contains either the next <code>Producer</code> or the final return value (<code>()</code> in this case). So if a <code>Producer</code> is analogous to a lazy <code>ByteString</code> then a <code>FreeT</code>-based "linked list" of <code>Producer</code>s is analogous to a true linked list of lazy <code>ByteString</code>s.</p><p>Each layer of our <code>FreeT</code> is a <code>Producer</code> that streams one line's worth of chunks. This differs from a single chunk one line long because it's still in <code>Producer</code> form so we haven't actually read anything from the file yet. Also, <code>FreeT</code> is smart and <em>statically enforces</em> that we cannot read lines from the next <code>Producer</code> (i.e. the next line) until we finish the first line.</p><p><code>FreeT</code> has a very important property which other solutions do not have: we can use <code>FreeT</code> to sub-divide the <code>Producer</code> into logical units while still keeping the original chunking scheme. Once we have these nice logical boundaries we can manipulate the <code>FreeT</code> using high-level list-like functions such as <code>Pipes.Parse.takeFree</code>:</p><pre><code>takeFree<br /> :: (Functor f, Monad m)<br /> => Int -> FreeT f m () -> FreeT f m ()</code></pre><p><code>takeFree</code> is the <code>FreeT</code> analog of <code>Prelude.take</code>. We keep the first three <code>f</code> layers of the <code>FreeT</code> and discard the rest. In this case our <code>f</code> is <code>(Producer ByteString m)</code> so if each <code>Producer</code> represents one line then we can take a fixed number of lines.</p><p>This works because <code>FreeT (Producer ByteString m) m ()</code> is just an ordinary Haskell data type. This data type doesn't contain any actual chunks. Instead, it is just a description of how we might traverse our input stream. When we call <code>takeFree</code> on it we are just saying: "Now that I think about it, I don't intend to traverse as much of the input as I had originally planned".</p><p><code>unlines</code> completes the analogy, collapsing our logical units back down into an unannotated stream of bytes:</p><pre><code>unlines<br /> :: (Monad m)<br /> => FreeT (Producer ByteString m) m r<br /> -> Producer ByteString m r</code></pre><p>We can then combine <code>lines</code>, <code>takeFree</code> and <code>unline</code>s to mirror the lazy <code>IO</code> approach:</p><pre><code>-- Lazy IO:<br />takeLines n = unlines . take n . lines<br /><br />-- pipes:<br />takeLines n = unlines . takeFree n . lines</code></pre><p>The main difference is that the intermediate types are larger because we're moving more information <strong>in</strong> to the type system and, more importantly, moving that same information <strong>out</strong> of implicit magic.</p><h4 id="conclusions">Conclusions</h4><p>The next <code>pipes</code> library in development is <code>pipes-text</code>. The main development bottle-neck (besides my own graduation) is that the <code>text</code> library does not export functions to partially decode incomplete <code>ByteString</code> fragments as much as possible without throwing errors. If somebody were to write that up it would speed up the release of <code>pipes-text</code> significantly.</p><p>As always, you can follow or contribute to <code>pipes</code> development or just ask questions by joining the <a href="https://groups.google.com/forum/?fromgroups#!forum/haskell-pipes">haskell-pipes mailing list</a>.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com10tag:blogger.com,1999:blog-1777990983847811806.post-84786642090237829272013-09-07T20:46:00.000-07:002013-09-07T20:47:58.430-07:00pipes-4.0: Simpler types and API<p>I'm announcing <code>pipes-4.0</code> which greatly simplifies the types and API of the <code>pipes</code> ecosystem. For people new to <code>pipes</code>, <code>pipes</code> is a compositional streaming library that decouples streaming programs into simpler, reusable components.</p><p>The purpose behind <code>pipes</code> is to simplify effectful stream programming for both library authors and application writers. Library authors can package streaming components into a highly reusable interface, and application writers can easily connect together streaming components to build correct, efficient, and low-memory streaming programs. For example, the following program connects three reusable components to stream lines from standard input to standard output until the user types <code>"quit"</code>:</p><pre><br />import Pipes<br />import qualified Pipes.Prelude as P<br /><br />main = runEffect $<br /> P.stdinLn >-> P.takeWhile (/= "quit") >-> P.stdoutLn<br /></pre><p><code>pipes</code> distinguishes itself from other stream programming libraries in three main ways:</p><ul class="incremental"><li>Insistence on elegance, symmetry, and theoretical purity</li><li>Careful attention to correctness, documentation, and detail</li><li>Emphasis on a high power-to-weight ratio</li></ul><br /><h4>Important links</h4><ul><li> The <a href="http://hackage.haskell.org/package/pipes">Hackage page</a><li> The <a href="http://hackage.haskell.org/packages/archive/pipes/4.0.0/doc/html/Pipes-Tutorial.html">official tutorial</a><li> The <a href="https://github.com/Gabriel439/Haskell-Pipes-Library">Github repository</a><li> The <a href="http://www.haskell.org/haskellwiki/Pipes">Haskell wiki page</a><li> The <a href="https://groups.google.com/forum/?fromgroups#!forum/haskell-pipes">haskell-pipes mailing list</a></ul><br /><h4>Credits</h4><p>This release was made possible due to the suggestions and contributions of many people and I want to give special mention to several people:</p><ul class="incremental"><li><p><a href="https://github.com/k0001">Renzo Carbonara</a>, who is the largest contributor of downstream libraries, building and maintaining <code>pipes-network</code>, <code>pipes-network-tls</code>, <code>pipes-zlib</code>, <code>pipes-binary</code>, <code>pipes-attoparsec</code> and <code>pipes-aeson</code>. He also provided lots of useful feedback on design proposals because of his experience maintaining these libraries.</p></li><li><p><a href="https://github.com/bgamari">Ben Gamari</a>, who contributed <code>pipes-vector</code> and <code>pipes-interleave</code>.</p></li><li><p><a href="http://ocharles.org.uk/">Oliver Charles</a> who contributed to the design of the new <code>pipes-parse</code> library in the process of developing the <code>pipes-tar</code> library.</p></li><li><p><a href="https://github.com/archblob">Csernik Flaviu Andrei</a>, who contributed the complete benchmark suite for <code>pipes</code>.</p></li><li><p><a href="https://github.com/merijn">Merijn Verstraaten</a>, who contribute the new <code>mtl</code> instances for <code>pipes</code>.</p></li><li><p><a href="https://github.com/errge">Gergely Risko</a>, who fixed a concurrency bug in <code>pipes-concurrency</code>.</p></li><li><p><a href="https://github.com/klao">Mihaly Barasz</a>, who contributed the complete test suite for <code>pipes-concurrency</code>.</p></li><li><p><a href="https://github.com/tonyday567">Tony Day</a> who helped automate the <code>pipes</code> ecosystem and contributed lots of useful feedback on documentation.</p></li><li><p><a href="https://github.com/Shimuuar">Aleksey Khudyakov</a> first proposed the idea to remove the old proxy transformer system and outsource the same functionality to monad transformers in the base monad. This change alone accounted for an almost 60% reduction in the library size and the greatest simplification of the types.</p></li><li><p><a href="http://blog.johantibell.com/">Johan Tibell</a> proposed the initial idea to provide a simpler unidirectional subset of the API by default. This removed the warts that bidirectionality introduced.</p></li><li><p><a href="https://github.com/fhaust">Florian Hofmann</a> whose work on <code>pipes-eep</code> led to the discovery of an <code>Arrow</code> instance for push-based pipes.</p></li><li><p><a href="https://github.com/aristidb">Aristid Breitkreuz</a> and <a href="https://github.com/PierreR">Pierre Radermecker</a>, who both caught important bugs in <code>pipes-parse</code> and <code>pipes-safe</code>.</p></li><li><p>Oliver Batchelor, whose work on integrating <code>pipes</code> with Cloud Haskell improved the design of <code>pipes-safe</code>.</p></li></ul><p>Also, I would like to also thank everybody who provided feedback on the library and its documentation and also contributed code.</p><br /><h4>Change Log</h4><p>People familiar with <code>pipes</code> will notice that the biggest change to the library is the elimination of the proxy transformer system. This was made possible by an insight of Aleksey Khudyakov that the proxy transformers were isomorphism to monad transformers in the base monad if you ignored their ability to be unwrapped before the <code>Proxy</code> layer. I later discovered how to unwrap these base monad transformers while preserving the <code>Proxy</code> layer, which made possible the complete elimination of the proxy transformer system.</p><p>This had the largest impact on simplifying the API:</p><ul class="incremental"><li><p>The number of exported functions dropped to approximately 1/3 of the original size (from about 300+ to 100+)</p></li><li><p>The number of modules dropped to 1/3 of the original size (from 18 to 6)</p></li><li><p>The <code>p</code> type parameter in type signatures disappeared, along with the <code>Proxy</code> type class, which became the concrete <code>Proxy</code> type (which was the old <code>ProxyFast</code> implementation).</p></li><li><p>No need for <code>runIdentityP</code> any longer</p></li></ul><p>The next most important change was a simplification of the API to a unidirectional subset which is the new default. This fixed several warts of the previous API:</p><ul class="incremental"><li><p>No more gratuitous <code>()</code> parameters</p></li><li><p>The pipe monad and category now overlap</p></li><li><p>Polymorphic type synonyms can now be used to simplify the types</p></li></ul><p>The original bidirectional functionality still remains intact within the <code>Pipes.Core</code> module. The only difference is that it is not exported by default.</p><p>The next important change was the realization that bind in the <code>respond</code> Category (i.e. <code>(//>)</code>) was exactly equivalent to a <code>for</code> loop, so the unidirectional API now uses <code>for</code> as a synonym for <code>(//>)</code> and produces really beautiful <code>for</code> loops.</p><p>Other important syntactic changes:</p><ul class="incremental"><li><p>The unidirectional API uses <code>yield</code> instead of <code>respond</code> like it was back in <code>pipes-1.0</code></p></li><li><p>The unidirectional API uses <code>await</code> instead of <code>request</code> like it was back in <code>pipes-1.0</code></p></li><li><p><code>runProxy</code> is now <code>runEffect</code></p></li><li><p><code>(>->)</code> is the unidirectional pull-based composition, instead of bidirectional composition</p></li></ul><p><code>Pipes.Prelude</code> has also changed to remove the suffix from all utilities, but is no longer re-exported from the main <code>Pipes</code> module.</p><p>The downstream libraries have been updated as well to use the <code>pipes-4.0</code> API and several of these now have much simpler APIs, too, particularly <code>pipes-safe</code>. I will discuss these libraries in separate library-specific posts later on.</p><br /><h4>Future Goals</h4><p>This release is intended to be the last major version bump. The next development priorities are:</p><ul class="incremental"><li><p>Stabilize the core <code>pipes</code> library</p></li><li><p>Improve distribution by packaging up <code>pipes</code> for several package managers</p></li><li><p>Continue to build out the <code>pipes</code> ecosystem, particularly dedicated <code>Text</code> and <code>ByteString</code> libraries</p></li></ul><p>The long-term goal is to get <code>pipes</code> into the Haskell platform once the API has proven itself stable and the ecosystem matures.</p><p>People interested in learning more about <code>pipes</code> or contributing to development can join the official <a href="https://groups.google.com/forum/#!forum/haskell-pipes">mailing list</a>.</p>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com8tag:blogger.com,1999:blog-1777990983847811806.post-45532281579871252042013-08-10T18:22:00.000-07:002013-08-13T11:27:29.918-07:00foldl-1.0.0: Composable, streaming, and efficient left foldsI'm releasing <a href="http://hackage.haskell.org/package/foldl">the <tt>foldl</tt> library</a>, which builds upon <a href="http://www.haskellforall.com/2013/08/composable-streaming-folds.html">this previous post</a> of mine. This library lets you build and combine multiple folds that traverse streams in constant memory.<br /><br />I will begin using the same example from the previous post: computing an average. The naive version would be written like this: <pre><br />average :: [Int] -> Int<br />average xs = sum xs `div` length xs<br /></pre>This does not work well because it: <ul><li> loads the entire list into memory, and: <li> cannot fuse away the intermediate list. </ul>We can improve on this using <tt>foldl</tt> where we will compute the <tt>average</tt> by combining a <tt>sum</tt> fold with a <tt>length</tt> fold, using <tt>Applicative</tt> style: <pre><br />import Control.Applicative<br />import qualified Control.Foldl as L<br /><br />average :: L.Fold Int Int<br />average = div <$> L.sum <*> L.length<br /></pre>All folds built this way have two nice properties which <tt>foldl</tt> guarantees: <ul><li> They only traverse the list once <li> They will not leak space </ul>We can use <tt>fold</tt> to apply this combined fold to the list: <pre><br />main = print $ L.fold average [0..100000000]<br /></pre>This gives good efficiency at about 1 nanosecond per list element because it fuses away the intermediate list: <pre><br />500000000<br /><br />real 0m0.956s<br />user 0m0.952s<br />sys 0m0.000s<br /></pre>We can see this if we inspect the core, too, which produces this reasonably tight loop (which I've cleaned up): <pre><br />$wgo =<br /> \ (x :: Int#)<br /> (y :: Int#) -- y is keeping tally of the sum<br /> (z :: Int#) -> -- z is keeping tally of the length<br /> case x of x' {<br /> __DEFAULT -><br /> $wgo<br /> (x' +# 1 )<br /> (y +# x')<br /> (z +# 1 );<br /> 1000000000 -><br /> (# I# (y +# 1000000000),<br /> I# (z +# 1) #)<br /> }<br /></pre>This works because <tt>fold</tt> is implemented in terms of <tt>foldr</tt> to trigger <tt>build/foldr</tt> fusion. However, not all lists will fuse this way. I find, for example, that lists of <tt>Double</tt>s refuse to fuse in this way: <pre><br />import Control.Applicative<br />import qualified Control.Foldl as L<br /><br />-- This gives worse performance: ~40 ns per step<br />average :: L.Fold Double Double<br />average = (/) <$> L.sum <*> L.genericLength<br /><br />main = print $ L.fold average [0..1000000000]<br /></pre>I haven't yet figured out how to trick GHC into fusing away these lists. If anybody knows how to do this, please contribute a patch.<br /><br />Fortunately, these folds preserve the original accumulator, step function, and extraction function, so you can always unpack the <tt>Fold</tt> type and then implement the list fusion yourself: <pre><br />main = case average of<br /> L.Fold step start done -> do<br /> let go n x =<br /> if (n > 1000000000)<br /> then x<br /> else go (n + 1) $! step x n<br /> print $ done (go 0 start)<br /></pre>So while reliable list fusion is still an unsolved problem, you can still use <tt>foldl</tt> to get reasonable performance and guarantee no space leaks. Also, there is always the option of using <tt>foldl</tt> to do the work of assembling derived strict and streaming folds and then implementing list fusion yourself if you want to squeak out every last drop of performance.<br /><br />I wrote up this library to dispel the myth that only experts can fold things efficiently in Haskell. <tt>foldl</tt> lets you begin from simple primitive folds that you can fit in your head and then chain them together into complex folds with little effort.Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com4tag:blogger.com,1999:blog-1777990983847811806.post-60763626024721050592013-08-03T23:18:00.000-07:002013-08-04T08:59:09.939-07:00Composable streaming foldsThe Haskell Prelude provides multiple ways to fold lists into a single value. For example, you can count the number of elements in a list: <pre><br />import Data.List (genericLength)<br /><br />genericLength :: (Num i) => [a] -> i<br /></pre>... or you can add them up: <pre><br />import Prelude hiding (sum)<br /><br />-- I'm deviating from the Prelude's sum, which leaks space<br />sum :: (Num a) => [a] -> a<br />sum = foldl' (+) 0<br /></pre>Individually, these two folds run in constant memory when given a lazy list as an argument, never bringing more than one element into memory at a time: <pre><br />>>> genericLength [1..100000000]<br />100000000<br />>>> sum' [1..100000000]<br />5000000050000000<br /></pre>However, we get an immediate space leak if we try to combine these two folds to compute an average: <pre><br />>>> let average xs = sum xs / genericLength xs<br />>>> average [1..100000000]<br /><Huge space leak><br /></pre>The original isolated folds streamed in constant memory because Haskell is lazy and does not compute each element of the list until the fold actually requests the element. After the fold traverses each element the garbage collector detects the element will no longer be used and collects it immediately, preventing any build-up of elements.<br /><br />However, when we combine these two folds naively like we did with <tt>average</tt> then our program leaks space while we compute <tt>sum</tt> and before we get a chance to compute <tt>genericLength</tt>. As <tt>sum</tt> traverses the list, the garbage collector cannot collect any of the elements because we have to hold on to the entire list for the subsequent <tt>genericLength</tt> fold.<br /><br />Unfortunately, the <a href="http://stackoverflow.com/a/1618864/1026598">conventional solution</a> to this is not pretty: <pre><br />mean :: [Double] -> Double<br />mean = go 0 0<br /> where<br /> go s l [] = s / fromIntegral l<br /> go s l (x:xs) = s `seq` l `seq`<br /> go (s+x) (l+1) xs<br /></pre>Here we've sliced open the guts of each fold and combined their individual step functions into a new step function so we can pass over the list just once. We also had to pay a lot of attention to detail regarding strictness. This is what newcomers to Haskell complain about when they say you need to be an expert at Haskell to produce highly efficient code.<br /><br /><br /><h4>The <tt>Fold</tt> type </h4><br />Let's fix this by reformulating our original folds to preserve more information so that we can transparently combine multiple folds into a single pass over the list: <pre><br />{-# LANGUAGE ExistentialQuantification #-}<br /><br />import Data.List (foldl')<br />import Data.Monoid<br /><br />data Fold a b = forall w. (Monoid w) => Fold<br /> { tally :: a -> w<br /> , summarize :: w -> b<br /> }<br /><br />fold :: Fold a b -> [a] -> b<br />fold (Fold t c) xs =<br /> c (foldl' mappend mempty (map t xs))<br /></pre>Here I've taken a fold and split it into two parts: <ul><li> <tt>tally</tt>: The step function that we use to accumulate each element of the list <li> <tt>summarize</tt>: The final function we call at the end of the fold to convert our accumulator into the desired result </ul>The <tt>w</tt> type variable represents the internal accumulator that our <tt>Fold</tt> will use as it traverses the list. The <tt>Fold</tt> can use any accumulator of its choice as long as the accumulator is a <tt>Monoid</tt> of some sort. We specify that in the types by existentially quantifying the accumulator using the <tt>ExistentialQuantification</tt> extension.<br /><br />The end user also doesn't care what the internal accumulator is either, because the user only interacts with <tt>Fold</tt>s using the <tt>fold</tt> function. The type system enforces that <tt>fold</tt> (or any other function) cannot use any specific details about a <tt>Fold</tt>'s accumulator other than the fact that the accumulator is a <tt>Monoid</tt>.<br /><br />We'll test out this type by rewriting out our original folds using the new <tt>Fold</tt> type: <pre><br />genericLength :: (Num i) => Fold a i<br />genericLength =<br /> Fold (\_ -> Sum 1) (fromIntegral . getSum)<br /><br />sum :: (Num a) => Fold a a<br />sum = Fold Sum getSum<br /></pre>Notice how the <tt>Monoid</tt> we choose implicitly encodes how to accumulate the result. <tt>genericLength</tt> counts the number of elements simply by mapping them all to <tt>Sum 1</tt>, and then the <tt>Monoid</tt> instance for <tt>Sum</tt> just adds up all these ones to get the list length. <tt>sum</tt> is even simpler: just wrap each element in <tt>Sum</tt> and the <tt>Monoid</tt> instance for <tt>Sum</tt> adds up every element of the list. When we're done, we unwrap the final result using <tt>getSum</tt>.<br /><br />We can now apply these folds to any list using the <tt>fold</tt> function, which handles all the details of accumulating each element of the list and summarizing the result: <pre><br />>>> fold genericLength [(1::Int)..100000000]<br />100000000<br />>>> fold sum [(1::Int)..100000000]<br />5000000050000000<br /></pre>So far, so good, but how do we combine them into an <tt>average</tt>?<br /><br /><br /><h4>Combining <tt>Fold</tt>s</h4><br /><tt>Fold</tt> has the nice property that it is an <tt>Applicative</tt>, given by the following definition: <pre><br />import Control.Applicative<br />import Data.Strict.Tuple<br /><br />instance (Monoid a, Monoid b) => Monoid (Pair a b) where<br /> mempty = (mempty :!: mempty)<br /> mappend (aL :!: aR) (bL :!: bR) =<br /> (mappend aL bL :!: mappend aR bR)<br /><br />instance Functor (Fold a) where<br /> fmap f (Fold t k) = Fold t (f . k)<br /><br />instance Applicative (Fold a) where<br /> pure a = Fold (\_ -> ()) (\_ -> a)<br /> (Fold tL cL) <*> (Fold tR cR) =<br /> let t x = (tL x :!: tR x)<br /> c (wL :!: wR) = (cL wL) (cR wR)<br /> in Fold t c<br /></pre>Note that this uses strict <tt>Pair</tt>s from <tt>Data.Strict.Tuple</tt> to ensure that the combined <tt>Fold</tt> still automatically runs in constant space. You only need to remember that <tt>(x :!: y)</tt> is the strict analog of <tt>(x, y)</tt>.<br /><br />With this <tt>Applicative</tt> instance in hand, we can very easily combine our <tt>sum</tt> and <tt>genericLength</tt> folds into an <tt>average</tt> fold: <pre><br />average :: (Fractional a) => Fold a a<br />average = (/) <$> sum <*> genericLength<br /></pre>This combines the two folds transparently into a single fold that traverses the list just once in constant memory, computing the average of all elements within the list: <pre><br />>>> fold average [1..1000000]<br />500000.5<br /></pre>Now we're programming at a high-altitude instead of hand-writing our own accumulators and left folds and praying to the strictness gods.<br /><br />What if we wanted to compute the standard deviation of a list? All we need is one extra primitive fold that computes the sum of squares: <pre><br />sumSq :: (Num a) => Fold a a<br />sumSq = Fold (\x -> Sum (x ^ 2)) getSum<br /></pre>Now we can write a derived fold using <tt>Applicative</tt> style: <pre><br />std :: (Floating a) => Fold a a<br />std = (\ss s len -> sqrt (ss / len - (s / len)^2))<br /> <$> sumSq<br /> <*> sum<br /> <*> genericLength<br /></pre>... which still traverses the list just once: <pre><br />fold std [1..10000000]<br />2886751.345954732<br /></pre>In fact, this is the exact same principle that the <a href="http://en.wikipedia.org/wiki/BIRCH_%28data_clustering%29">BIRCH data clustering algorithm</a> uses for clustering features. You keep a tally of the length, sum, and sum of squares, and you can compute most useful statistics in O(1) time from those three tallies.<br /><br />Similarly, what if we wanted to compute both the <tt>sum</tt> and <tt>product</tt> of a list in a single pass? <pre><br />product :: (Num a) => Fold a a<br />product = Fold Product getProduct<br /></pre>Once again, we can just use <tt>Applicative</tt> style: <pre><br />>>> fold ((,) <$> sum <*> product) [1..100]<br />(5050,9332621544394415268169923885626670049071596826438162146859<br />2963895217599993229915608941463976156518286253697920827223758251<br />185210916864000000000000000000000000)<br /></pre><br /><h4>Conclusion</h4><br />Contrary to conventional wisdom, you can program in Haskell at a high level without leaking space. Haskell gives you the tools to abstract away efficient idioms behind a convenient and composable interface, so use them!<br /><br /><br /><h4>Appendix</h4><br />I've included the full code so that people can play with this themselves: <pre><br />{-# LANGUAGE ExistentialQuantification #-}<br /><br />import Control.Applicative<br />import Data.List (foldl')<br />import Data.Monoid<br />import Data.Strict.Tuple<br />import Prelude hiding (sum, length)<br /><br />data Fold a b = forall w. (Monoid w) => Fold<br /> { tally :: a -> w<br /> , compute :: w -> b<br /> }<br /><br />fold :: Fold a b -> [a] -> b<br />fold (Fold t c) xs =<br /> c (foldl' mappend mempty (map t xs))<br /><br />instance (Monoid a, Monoid b) => Monoid (Pair a b) where<br /> mempty = (mempty :!: mempty)<br /> mappend (aL :!: aR) (bL :!: bR) =<br /> (mappend aL bL :!: mappend aR bR)<br /><br />instance Functor (Fold a) where<br /> fmap f (Fold t k) = Fold t (f . k)<br /><br />instance Applicative (Fold a) where<br /> pure a = Fold (\_ -> ()) (\_ -> a)<br /> (Fold tL cL) <*> (Fold tR cR) =<br /> let t x = (tL x :!: tR x)<br /> c (wL :!: wR) = (cL wL) (cR wR)<br /> in Fold t c<br /><br />genericLength :: (Num b) => Fold a b<br />genericLength =<br /> Fold (\_ -> Sum (1::Int)) (fromIntegral . getSum)<br /><br />sum :: (Num a) => Fold a a<br />sum = Fold Sum getSum<br /><br />sumSq :: (Num a) => Fold a a<br />sumSq = Fold (\x -> Sum (x ^ 2)) getSum<br /><br />average :: (Fractional a) => Fold a a<br />average = (\s c -> s / c) <$> sum <*> genericLength<br /><br />product :: (Num a) => Fold a a<br />product = Fold Product getProduct<br /><br />std :: (Floating a) => Fold a a<br />std = (\ss s len -> sqrt (ss / len - (s / len)^2))<br /> <$> sumSq<br /> <*> sum<br /> <*> genericLength<br /></pre>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com16tag:blogger.com,1999:blog-1777990983847811806.post-31417561500569148722013-08-02T07:33:00.000-07:002013-08-02T09:19:56.715-07:00Sometimes less is more in language designHaskell programmers commonly say that Haskell code is easy to reason about, but rarely explain why. This stems from one simple guiding principle: Haskell is <i>simple by default</i>.<br /><br />Wait, what? Are we talking about the same language? The one with monads and <a href="http://www.haskell.org/haskellwiki/Zygohistomorphic_prepromorphisms">zygohistomorphic prepromorphisms</a>? Yes, I mean that Haskell.<br /><br />For example, what does this type signature tell us: <pre><br />x :: Bool<br /></pre>This type signature says that <tt>x</tt> is a boolean value ... and that's it! Type signatures are stronger in Haskell than other languages because they also tells us what values are <b>not</b>: <ul><li> <tt>x</tt> is <b>not</b> a time-varying value <li> <tt>x</tt> is <b>not</b> nullable <li> <tt>x</tt> is <b>not</b> a <tt>String</tt> being coerced to a truthy value <li> <tt>x</tt> does <b>not</b> have any side effects </ul>In other words, complexity is <i>opt-in</i> when you program in Haskell.<br /><br />Imperative languages, object-oriented languages, and most other functional languages begin from a more complex baseline than Haskell. They all compete for which language provides the most built-in bells and whistles, because they all begin from the premise that more built-in features is better.<br /><br />However, the problem is that you can't opt out of these features and you can't assume that any library function you call doesn't use all of them. This means that you must either rely on careful documentation like: <ul><li> "This function need not be reentrant. A function that is not required to be reentrant is not required to be thread-safe." <li> "<b>Throws</b>: will not throw" <li> "The array is changed every time the block is called, not after the iteration is over" <li> "If <tt>x</tt> is not a Python <tt>int</tt> object, it has to define an <tt>__index__()</tt> method that returns an integer. <li> "Great care must be exercised if mutable objects are map keys" </ul>... or you must carefully inspect the original source code.<br /><br />Haskell takes a different design tradeoff: you begin from a simpler baseline and explicitly declare what you add. If you want statefulness, you have to declare it in the type: <pre><br />-- Read and write an Int state to compute a String<br />stateful :: State Int String<br /></pre>If you want side effects, you have to declare it in the type: <pre><br />takeMedicine :: IO ()<br />-- Ask your doctor if this medicine is right for you<br /></pre>If you want a value to be nullable, you have to declare it in the type: <pre><br />toBeOrNotToBe :: Maybe Be<br />-- That is the question<br /></pre>Notice that there are some things that Haskell does not reflect in the types, like laziness and unchecked exceptions. Unsurprisingly, these are also two built-in features that people regularly complain about when using Haskell.<br /><br />Technically, all these type signatures are optional because Haskell has type inference. If you hate pomp and circumstance and you just want to churn out code then by all means leave them out and the compiler will handle all the types behind the scenes for you.<br /><br />However, you should add explicit type signatures when you share code with other people. These type signatures mentally prepare people who read your code because they place tight bounds on how much context is necessary to understand each function. The less context your code requires, the more easily others can reason about your code in isolation.Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com4tag:blogger.com,1999:blog-1777990983847811806.post-64497889401618947772013-07-13T09:30:00.001-07:002013-07-14T13:04:25.161-07:00Statements vs ExpressionsMany programming languages have separate concepts for statements and expressions. I like to think of the distinction between them as follows: <blockquote><b>Statement:</b> What code <i>does</i></blockquote><blockquote><b>Expression:</b> What code <i>is</i></blockquote>I want to clarify that in this post I will deliberately use the term "statement" very broadly to refer to anything that is not an expression or type declaration. If you disagree with this terminology then I welcome suggestions for an alternative name.<br /><br />The distinction between statements and expressions closely parallels the difference between imperative languages and functional languages: <blockquote><b>Imperative</b>: A language that emphasizes <i>statements</i></blockquote><blockquote><b>Functional</b>: A language that emphasizes <i>expressions</i></blockquote> C lies at one end of the spectrum, relying heavily on statements to accomplish everything. The classic example is iterating over an array: <pre><br />#include <stdio.h><br /><br />int main(int argc, char *argv[]) {<br /> int elems[5] = {1, 2, 3, 4, 5}; // Statement<br /><br /> int total = 0;<br /> int i;<br /><br /> // +- Statement<br /> // | +- Statement<br /> // | |<br /> // v v<br /> for (i = 0; i < 5; i++) {<br /> total += elems[i]; // Statement<br /> }<br /> printf("%d\n", total); // Statement<br /><br /> return 0;<br />}<br /></pre>Haskell lies at the exact opposite extreme, using expressions heavily: <pre><br />main = print (sum [1..5]) -- Expression<br /></pre>In fact, Haskell takes this principle to the extreme: everything in Haskell is an expression, and even statements are expressions.<br /><br />For example, the following code might appear to be a traditional imperative-style sequence of statements: <pre><br />main = do<br /> putStrLn "Enter a number:" -- Statement?<br /> str <- getLine -- Statement?<br /> putStrLn ("You entered: " ++ str) -- Statement?<br /></pre>... but <tt>do</tt> notation is merely syntactic sugar for nested applications of <tt>(>>=)</tt>, which is itself nothing more than an infix higher-order function: <pre><br />main =<br /> putStrLn "Enter a number:" >>= (\_ -> -- Expression<br /> getLine >>= (\str -> -- Sub-expression<br /> putStrLn ("You entered: " ++ str) )) -- Sub-expression<br /></pre>In Haskell, "statements" are actually nested expressions, and sequencing statements just builds larger and larger expressions.<br /><br />This statement-as-expression paradigm promotes consistency and prevents arbitrary language limitations, such as Python's restriction of <tt>lambda</tt>s to single statements. In Haskell, you cannot limit the number of statements a term uses any more than you can limit the number of sub-expressions.<br /><br /><br /><h4>Monads</h4><br /><tt>do</tt> notation works for more than just <tt>IO</tt>. Any type that implements the <tt>Monad</tt> class can be "sequenced" in statement form, as long as it supports the following two operations: <pre><br />class Monad m where<br /> (>>=) :: m a -> (a -> m b) -> m b<br /><br /> return :: a -> m a<br /></pre>This provides a uniform interface for translating imperative statement-like syntax into expressions under the hood.<br /><br />For example, the <tt>Maybe</tt> type (Haskell's version of nullable) implements the <tt>Monad</tt> class: <pre><br />data Maybe a = Nothing | Just a<br /><br />instance Monad Maybe where<br /> m >>= f = case m of<br /> Nothing -> Nothing<br /> Just a -> f a<br /><br /> return = Just<br /></pre>This lets you assemble <tt>Maybe</tt>-based computations using <tt>do</tt> notation, like so: <pre><br />example :: Maybe Int<br />example = do<br /> x <- Just 1<br /> y <- Nothing<br /> return (x + y)<br /></pre>The above code desugars to nested calls to <tt>(>>=)</tt>: <pre><br />example =<br /> Just 1 >>= (\x -><br /> Nothing >>= (\y -><br /> return (x + y) ))<br /></pre>The compiler then substitutes in our definition of <tt>(>>=)</tt> and <tt>return</tt>, which produces the following expression: <pre><br />example = case (Just 1) of<br /> Nothing -> Nothing<br /> Just x -> case Nothing of<br /> Nothing -> Nothing<br /> Just y -> Just (x + y)<br /></pre>We can then hand-evaluate this expression to prove that it short-circuits when it encounters <tt>Nothing</tt>: <pre><br />-- Evaluate the outer `case`<br />example = case Nothing of<br /> Nothing -> Nothing<br /> Just y -> Just (1 + y)<br /><br />-- Evaluate the remaining `case`<br />example = Nothing<br /></pre><br /><h4>Semantics</h4><br />Notice that we can evaluate these <tt>Maybe</tt> "statements" without invoking any sort of abstract machine. When everything is an expression, everything is simple to evaluate and does not require understanding or invoking an execution model.<br /><br />In fact, the distinction between statements and expressions also closely parallels another important divide: the difference between operational semantics and denotational semantics. <blockquote><b>Operational semantics:</b> Translates code to abstract machine <i>statements</i></blockquote><blockquote><b>Denotational semantics:</b> Translates code to mathematical <i>expressions</i></blockquote>Haskell teaches you to think denotationally in terms of expressions and their meanings instead of statements and an abstract machine. This is why Haskell makes you a better programmer: you separate your mental model from the underlying execution model, so you can more easily identify common patterns between diverse programming languages and problem domains.Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com15tag:blogger.com,1999:blog-1777990983847811806.post-52519473598292958632013-06-28T23:13:00.000-07:002013-06-29T09:20:56.110-07:00The Resource MonadEdit: tomejaguar points out on <a href="http://www.reddit.com/r/haskell">/r/haskell</a> that there is a <tt>Monad</tt> instance for this type. The original version of this post said that <tt>Resource</tt> was only an <tt>Applicative</tt>. See the discussion <a href="http://www.reddit.com/r/haskell/comments/1haums/the_resource_applicative/casj6pl">here</a>.<br /><br />I'm writing this post to briefly share a neat trick to manage acquisition and release of multiple resources using <tt>Monad</tt>s. I haven't seen this trick in the wild, so I thought it was worth mentioning.<br /><br /><br /><h4>Resources</h4><br />A <tt>Resource</tt> is like a handle with built-in allocation and deallocation logic. The type of a <tt>Resource</tt> is simple: <pre><br />newtype Resource a = Resource { acquire :: IO (a, IO ()) }<br /></pre>A <tt>Resource</tt> is an <tt>IO</tt> action which acquires some resource of type <tt>a</tt> and also returns a finalizer of type <tt>IO ()</tt> that releases the resource. You can think of the <tt>a</tt> as a <tt>Handle</tt>, but it can really be anything which can be acquired or released, like a <tt>Socket</tt> or AMQP <tt>Connection</tt>.<br /><br />We can also provide an exception-safe way to access a <tt>Resource</tt> using <tt>bracket</tt>: <pre><br />runResource :: Resource a -> (a -> IO ()) -> IO ()<br />runResource resource k = bracket (acquire resource)<br /> (\(_, release) -> release)<br /> (\(a, _) -> k a)<br /></pre>This ensures every acquisition is paired with a release.<br /><br /><br /><h4>Theory</h4><br /><tt>Resource</tt> is both a <tt>Functor</tt> and <tt>Applicative</tt>, using the following two instances: <pre><br />instance Functor Resource where<br /> fmap f resource = Resource $ do<br /> (a, release) <- acquire resource<br /> return (f a, release)<br /><br />instance Applicative Resource where<br /> pure a = Resource (pure (a, pure ()))<br /> resource1 <*> resource2 = Resource $ do<br /> (f, release1) <- acquire resource1<br /> (x, release2) <- acquire resource2 `onException` release1<br /> return (f x, release2 >> release1)<br /><br />instance Monad Resource where<br /> return a = Resource (return (a, return ()))<br /> m >>= f = Resource $ do<br /> (m', release1) <- acquire m<br /> (x , release2) <- acquire (f m') `onException` release1<br /> return (x, release2 >> release1)<br /></pre>These two instances satisfy the <tt>Functor</tt>, <tt>Applicative</tt>, and <tt>Monad</tt> laws, assuming only that <tt>IO</tt> satisfies the <tt>Monad</tt> laws. <br /><br /><h4>Examples</h4><br />The classic example of a managed resource is a file: <pre><br />import Resource -- The above code<br />import System.IO<br /><br />file :: IOMode -> FilePath -> Resource Handle<br />file mode path = Resource $ do<br /> handle <- openFile path mode<br /> return (handle, hClose handle)<br /></pre>Using the <tt>Applicative</tt> instance we can easily combine an input and output file into a single resource: <pre><br />import Control.Applicative<br /><br />inAndOut :: Resource (Handle, Handle)<br />inAndOut = (,) <$> file ReadMode "file1.txt"<br /> <*> file WriteMode "out.txt"<br /></pre>... and acquire both handles in one step using <tt>runResource</tt>: <pre><br />main = runResource inAndOut $ \(hIn, hOut) -> do<br /> str <- hGetContents hIn<br /> hPutStr hOut str<br /></pre>The above program will copy the contents of <tt>file1.txt</tt> to <tt>out.txt</tt>: <pre><br />$ cat file1.txt<br />Line 1<br />Line 2<br />Line 3<br />$ ./example<br />$ cat out.txt<br />Line 1<br />Line 2<br />Line 3<br />$<br /></pre>Even cooler, we can allocate an entire list of <tt>Handle</tt>s in one fell swoop, using <tt>traverse</tt> from <tt>Data.Traversable</tt>: <pre><br />import qualified Data.Traversable as T<br />import Control.Monad<br />import System.Environment<br /><br />main = do<br /> filePaths <- getArgs<br /> let files :: Resource [Handle]<br /> files = T.traverse (file ReadMode) filePaths<br /> runResource files $ \hs -> do<br /> forM_ hs $ \h -> do<br /> str <- hGetContents h<br /> putStr str<br /><br /></pre>The above program behaves like <tt>cat</tt>, concatenating the contents of all the files passed on the command line: <pre><br />$ cat file1.txt<br />Line 1<br />Line 2<br />Line 3<br />$ cat file2.txt<br />Line 4<br />Line 5<br />Line 6<br />$ ./example file1.txt file2.txt file1.txt<br />Line 1<br />Line 2<br />Line 3<br />Line 4<br />Line 5<br />Line 6<br />Line 1<br />Line 2<br />Line 3<br />$<br /></pre>The above example is gratuitous because we could have acquired just one handle at a time. However, you will appreciate how useful this is if you ever need to acquire multiple managed resources in an exception-safe way without using <tt>Resource</tt>.<br /><br /><br /><h4>Conclusion</h4><br />I haven't seen this in any library on Hackage, so if there is any interest in this abstraction I can package it up into a small library. I can see this being used when you can't predict in advance how many resources you will need to acquire or as a convenient way to bundle multiple managed resources into a single data type.<br /><br /><br /><h4>Appendix</h4><br />I've included code listings for the above examples so people can experiment with them: <pre><br />-- Resource.hs<br /><br />module Resource where<br /><br />import Control.Applicative (Applicative(pure, (<*>)))<br />import Control.Exception (bracket, onException)<br /><br />newtype Resource a = Resource { acquire :: IO (a, IO ()) }<br /><br />instance Functor Resource where<br /> fmap f resource = Resource $ do<br /> (a, release) <- acquire resource<br /> return (f a, release)<br /><br />instance Applicative Resource where<br /> pure a = Resource (pure (a, pure ()))<br /> resource1 <*> resource2 = Resource $ do<br /> (f, release1) <- acquire resource1<br /> (x, release2) <- acquire resource2 `onException` release1<br /> return (f x, release2 >> release1)<br /><br />instance Monad Resource where<br /> return a = Resource (return (a, return ()))<br /> m >>= f = Resource $ do<br /> (m', release1) <- acquire m<br /> (x , release2) <- acquire (f m') `onException` release1<br /> return (x, release2 >> release1<br /><br />runResource :: Resource a -> (a -> IO ()) -> IO ()<br />runResource resource k = bracket (acquire resource)<br /> (\(_, release) -> release)<br /> (\(a, _) -> k a)<br /></pre><pre><br />-- example.hs<br /><br />import Control.Applicative<br />import Control.Monad<br />import qualified Data.Traversable as T<br />import Resource<br />import System.Environment<br />import System.IO<br /><br />file :: IOMode -> FilePath -> Resource Handle<br />file mode path = Resource $ do<br /> handle <- openFile path mode<br /> return (handle, hClose handle)<br /><br />inAndOut :: Resource (Handle, Handle)<br />inAndOut =<br /> (,) <$> file ReadMode "file1.txt"<br /> <*> file WriteMode "out.txt"<br /><br />main = runResource inAndOut $ \(hIn, hOut) -> do<br /> str <- hGetContents hIn<br /> hPutStr hOut str<br /><br />{-<br />main = do<br /> filePaths <- getArgs<br /> let files :: Resource [Handle]<br /> files = T.traverse (file ReadMode) filePaths<br /> runResource files $ \hs -> do<br /> forM_ hs $ \h -> do<br /> str <- hGetContents h<br /> putStr str<br />-}<br /></pre>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-56320499826933351332013-06-23T07:00:00.000-07:002013-06-25T16:28:42.424-07:00From zero to cooperative threads in 33 lines of Haskell codeHaskell differentiates itself from most functional languages by having deep cultural roots in mathematics and computer science, which gives the misleading impression that Haskell is poorly suited to solving practical problems. However, the more you learn Haskell more you appreciate that theory is often the most practical solution to many common programming problems. This post will underscore this point by mixing off-the-shelf theoretical building blocks to create a pure user-land threading system.<br /><br /><br /><h4>The Type</h4><br />Haskell is a types-first language, so we will begin by choosing an appropriate type to represent threads. First we must state in plain English what we want threads to do: <ul><li> Threads must extend existing sequences of instructions <li> Threads must permit a set of operations: forking, yielding control, and terminating. <li> Threads should permit multiple types of schedulers </ul>Now we translate those concepts into Haskell: <ul><li> When you hear "multiple interpreters/schedulers/backends" you should think "free" (as in "free object") <li> When you hear "sequence of instructions" you should think: "monad". <li> When you qualify that with "extend" you should think: "monad transformer". </ul>Combine those words together and you get the correct mathematical solution: a "free monad transformer".<br /><br /><br /><h4>Syntax trees</h4><br />"Free monad transformer" is a fancy mathematical name for an abstract syntax tree where sequencing plays an important role. We provide it with an instruction set and it builds us a syntax tree from those instructions.<br /><br />We said we want our thread to be able to fork, yield, or terminate, so let's make a data type that forks, yields, or terminates: <pre><br />{-# LANGUAGE DeriveFunctor #-}<br /><br />data ThreadF next = Fork next next<br /> | Yield next<br /> | Done<br /> deriving (Functor)<br /></pre><tt>ThreadF</tt> represents our instruction set. We want to add three new instructions, so <tt>ThreadF</tt> has three constructors, one for each instruction: <tt>Fork</tt>, <tt>Yield</tt>, and <tt>Done</tt>.<br /><br />Our <tt>ThreadF</tt> type represents one node in our syntax tree. The <tt>next</tt> fields of the constructors represent where the children nodes should go. <tt>Fork</tt> creates two execution paths, so it has two children. <tt>Done</tt> terminates the current execution path, so it has zero children. <tt>Yield</tt> neither branches nor terminates, so it has one child. The <tt>deriving (Functor)</tt> part just tells the free monad transformer that the <tt>next</tt> fields are where the children should go.<br /><br />Now the free monad transformer, <tt>FreeT</tt>, can build a syntax tree from our instruction set. We will call this tree a <tt>Thread</tt>: <pre><br />import Control.Monad.Trans.Free -- from the `free` package<br /><br />type Thread = FreeT ThreadF<br /></pre>An experienced Haskell programmer would read the above code as saying "A <tt>Thread</tt> is a syntax tree built from <tt>ThreadF</tt> instructions".<br ><br /><br /><h4>Instructions</h4><br />Now we need primitive instructions. The <tt>free</tt> package provides the <tt>liftF</tt> operation which converts a single instruction into a syntax tree one node deep: <pre><br />yield :: (Monad m) => Thread m ()<br />yield = liftF (Yield ())<br /><br />done :: (Monad m) => Thread m r<br />done = liftF Done<br /><br />cFork :: (Monad m) => Thread m Bool<br />cFork = liftF (Fork False True)<br /></pre>You don't need to completely understand how that works, except to notice that the return value of each command corresponds to what we store in the child fields of the node: <ul><li> The <tt>yield</tt> command stores <tt>()</tt> as its child, so its return value is <tt>()</tt><li> The <tt>done</tt> command has no children, so the compiler deduces that it has a polymorphic return value (i.e. <tt>r</tt>), meaning that it never finishes <li> The <tt>cFork</tt> command stores boolean values as children, so it returns a <tt>Bool</tt></ul><tt>cFork</tt> gets its name because it behaves like the <tt>fork</tt> function from C, meaning that the <tt>Bool</tt> return value tells us which branch we are on after the fork. If we receive <tt>False</tt> then we are on the left branch and if we receive <tt>True</tt> then we are on the right branch.<br /><br />We can combine <tt>cFork</tt> and <tt>done</tt> to reimplement a more traditional Haskell-style fork, using the convention that the left branch is the "parent" and the right branch is the "child": <pre><br />import Control.Monad<br /><br />fork :: (Monad m) => Thread m a -> Thread m ()<br />fork thread = do<br /> child <- cFork<br /> when child $ do<br /> thread<br /> done<br /></pre>The above code calls <tt>cFork</tt> and then says "if I am the child, run the forked action and then stop, otherwise proceed as normal".<br /><br /><br /><h4>Free monads</h4><br />Notice that something unusual happened in the last code snippet. We assembled primitive <tt>Thread</tt> instructions like <tt>cFork</tt> and <tt>done</tt> using <tt>do</tt> notation and we got a new <tt>Thread</tt> back. This is because Haskell lets us use <tt>do</tt> notation to assemble any type that implements the <tt>Monad</tt> interface and our free monad transformer type automatically deduces the correct <tt>Monad</tt> instance for <tt>Thread</tt>. Convenient!<br /><br />Actually, our free monad transformer is not being super smart at all. When we assemble free monad transformers using <tt>do</tt> notation, all it does is connect these primitive one-node-deep syntax trees (i.e. the instructions) into a larger syntax tree. When we sequence two commands like: <pre><br />do yield<br /> done<br /></pre>... this desugars to just storing the second command (i.e. <tt>done</tt>) as a child of the first command (i.e. <tt>yield</tt>).<br /><br /><br /><h4>The scheduler</h4><br />Now we're going to write our own thread scheduler. This will be a naive round-robin scheduler: <pre><br />import Data.Sequence -- Queue with O(1) head and tail operations<br /><br />roundRobin :: (Monad m) => Thread m a -> m ()<br />roundRobin t = go (singleton t) -- Begin with a single thread<br /> where<br /> go ts = case (viewl ts) of<br /> -- The queue is empty: we're done!<br /> EmptyL -> return ()<br /><br /> -- The queue is non-empty: Process the first thread<br /> t :< ts' -> do<br /> x <- runFreeT t -- Run this thread's effects<br /> case x of<br /> -- New threads go to the back of the queue<br /> Free (Fork t1 t2) -> go (t1 <| (ts' |> t2))<br /><br /> -- Yielding threads go to the back of the queue<br /> Free (Yield t') -> go (ts' |> t')<br /><br /> -- Thread done: Remove the thread from the queue<br /> Free Done -> go ts'<br /> Pure _ -> go ts'<br /></pre>... and we're done! No really, that's it! That's the whole threading implementation.<br /><br /><br /><h4>User-land threads</h4><br />Let's try out our brave new threading system. We'll start off simple: <pre><br />mainThread :: Thread IO ()<br />mainThread = do<br /> lift $ putStrLn "Forking thread #1"<br /> fork thread1<br /> lift $ putStrLn "Forking thread #1"<br /> fork thread2<br /><br />thread1 :: Thread IO ()<br />thread1 = forM_ [1..10] $ \i -> do<br /> lift $ print i<br /> yield<br /><br />thread2 :: Thread IO ()<br />thread2 = replicateM_ 3 $ do<br /> lift $ putStrLn "Hello"<br /> yield<br /></pre>Each of these threads has type <tt>Thread IO ()</tt>. <tt>Thread</tt> is a "monad transformer", meaning that it extends an existing monad with additional functionality. In this case, we are extending the <tt>IO</tt> monad with our user-land threads, which means that any time we need to call <tt>IO</tt> actions we must use <tt>lift</tt> to distinguish <tt>IO</tt> actions from <tt>Thread</tt> actions.<br /><br />When we call <tt>roundRobin</tt> we unwrap the <tt>Thread</tt> monad transformer and our threaded program collapses to a linear sequence of instructions in <tt>IO</tt>: <pre><br />>>> roundRobin mainThread :: IO ()<br />Forking thread #1<br />Forking thread #1<br />1<br />Hello<br />2<br />Hello<br />3<br />Hello<br />4<br />5<br />6<br />7<br />8<br />9<br />10<br /></pre>Moreover, this threading system is pure! We can extend monads other than <tt>IO</tt>, yet still thread effects. For example, we can build a threaded <tt>Writer</tt> computation, where <tt>Writer</tt> is one of Haskell's many pure monads: <pre><br />import Control.Monad.Trans.Writer<br /><br />logger :: Thread (Writer [String]) ()<br />logger = do<br /> fork helper<br /> lift $ tell ["Abort"]<br /> yield<br /> lift $ tell ["Fail"]<br /><br />helper :: Thread (Writer [String]) ()<br />helper = do<br /> lift $ tell ["Retry"]<br /> yield<br /> lift $ tell ["!"]<br /></pre>This time <tt>roundRobin</tt> produces a pure <tt>Writer</tt> action when we run <tt>logger</tt>: <pre><br />roundRobin logger :: Writer [String] ()<br /></pre>... and we can extract the results of that logging action purely, too: <pre><br />execWriter (roundRobin logger) :: [String]<br /></pre>Notice how the type evaluates to a pure value, a list of <tt>String</tt>s in this case. Yet, we still get real threading of logged values: <pre><br />>>> execWriter (roundRobin logger)<br />["Abort","Retry","Fail","!"]<br /></pre><br /><h4>Conclusion</h4><br />You might think I'm cheating by off-loading the real work onto the <tt>free</tt> library, but all the functionality I used from that library boils down to 12 lines of very generic and reusable code (see the Appendix). This is a recurring theme in Haskell: when we stick to the theory we get reusable, elegant, and powerful solutions in a shockingly small amount of code.<br /><br />The inspiration for this post was a computer science paper written by Peng Li and Steve Zdancewic titled <a href="http://www.cis.upenn.edu/~stevez/papers/LZ06b.pdf">A Language-based Approach to Unifying Events and Threads</a>. The main difference is that I converted their continuation-based approach to a simpler free monad approach.<br /><br />Edit: <tt>aseipp</tt> on <a href="http://www.reddit.com/r/haskell/">/r/haskell</a> just pointed out that my post is scarily similar to a pre-existing functional pearl: <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.39.8039">A Poor Man's Concurrency Monad</a> by Koen Classen. <br /><h4>Appendix: Free monad transformer code</h4><br />The essence of a syntax tree distilled into 12 lines of code: <pre><br />data FreeF f a x = Pure a | Free (f x)<br /><br />newtype FreeT f m a =<br /> FreeT { runFreeT :: m (FreeF f a (FreeT f m a)) }<br /><br />instance (Functor f, Monad m) => Monad (FreeT f m) where<br /> return a = FreeT (return (Pure a))<br /> FreeT m >>= f = FreeT $ m >>= \v -> case v of<br /> Pure a -> runFreeT (f a)<br /> Free w -> return (Free (fmap (>>= f) w))<br /><br />instance MonadTrans (FreeT f) where<br /> lift = FreeT . liftM Pure<br /><br />liftF :: (Functor f, Monad m) => f r -> FreeT f m r<br />liftF x = FreeT (return (Free (fmap return x)))<br /></pre>Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0tag:blogger.com,1999:blog-1777990983847811806.post-62964721959904429532013-06-06T06:39:00.001-07:002013-06-07T17:47:54.592-07:00pipes-concurrency-1.2.0: Behaviors and broadcasts<tt>pipes-concurrency-1.2.0</tt> adds two cool new features: <ul><li> "continuous" events (what <tt>reactive-banana</tt> calls <tt>Behavior</tt>s) <li> broadcasts </ul>Both of them are very interesting both practically and theoretically.<br /><br /><h4>Behaviors</h4><br />One of the major deficiencies of the <tt>pipes-concurrency-1.0</tt> API was the requirement that every event was handled by downstream listeners. This is the sort of interaction that <tt>reactive-banana</tt> would call an <tt>Event</tt> where there is a one-to-one correspondence between production of events and consumption of events.<br /><br />However, this model breaks down when you have two things updating at a different rate. For example, if you have a mouse and a monitor you don't want either one tied to the other's update cycle. You can't have the monitor respond to every mouse event because it cannot refresh fast enough. Similarly, you cannot have the mouse's event loop track the monitor's refresh rate because then you will develop an ever-increasing backlog of events.<br /><br /><tt>pipes-concurrency</tt> solves this by providing a new option for the <tt>spawn</tt> command: the <tt>Latest</tt> constructor. This replaces the internal buffer with a single value that simply tracks the "latest" value received from upstream, discarding old values immediately. This completely decouples the input and output ends which no longer need to match each others' pace: <pre><br />main = do<br /> (input, output) <- spawn (Latest 0)<br /> -- Update at any rate<br /> a1 <- async $ do runProxy $ events >-> sendD input<br /> performGC<br /> -- Poll the latest value<br /> a2 <- async $ do runProxy $ recvS output >-> handler<br /> performGC<br /> mapM_ wait [a1, a2]<br /></pre>The most fascinating part of this addition is how unintrusive it is. This only required adding 7 lines of code (1 for the new constructor, and 6 for the additional case branch for <tt>spawn</tt>), yet all of the advanced features of the library (like termination detection and deadlock safety) still work correctly with this new <tt>Latest</tt> constructor.<br /><br />This will most likely interest people interested in functional reactive programming (FRP), because it flies in the face of conventional thinking where <tt>Event</tt>s and <tt>Behavior</tt>s are typically treated very differently. <tt>pipes-concurrency</tt> suggests that these two concepts may actually be more similar than people suspect when viewed through the appropriate lens.<br /><br />What differentiates <tt>pipes-concurrency</tt> from typical reactive programming abstractions is that it does not reify the streams as the central component for non-deterministic concurrency. In fact, the core machinery of <tt>pipes-concurrency</tt> is entirely <tt>pipes</tt>-independent, with the <tt>pipes</tt> utilities just forming a convenient interface. Rather, the central abstraction is the mailbox and how it coordinates inbound messages with outbound messages. When you view reactive programming through the lens of mailboxes instead of streaming processes then <tt>Behavior</tt>s and <tt>Event</tt>s differ only very slightly in their implementations and they both support the same interface.<br /><br /><br /><h4>Broadcasts</h4><br />Broadcasts are a neat little feature that arose out of a discussion with Florian Hofmann. Now <tt>Input</tt>s are <tt>Monoid</tt>s and if you want to broadcast to multiple mailboxes, just <tt>mconcat</tt> their <tt>Input</tt>s and <tt>send</tt> messages to the combined <tt>Input</tt>: <pre><br />import Control.Monad<br />import Control.Concurrent.Async<br />import Control.Proxy<br />import Control.Proxy.Concurrent<br />import Data.Monoid<br /><br />main = do<br /> -- Subscribers<br /> (inputs, outputs) <-<br /> fmap unzip $ replicateM 10 $ spawn Unbounded<br /><br /> -- Broadcast to all subscribers<br /> a <- async $ do runProxy $ stdinS >-> sendD (mconcat inputs)<br /> performGC<br /><br /> -- Receive broadcasted messages<br /> as <- forM outputs $ \o -><br /> async $ do runProxy $ recvS o >-> stdoutD<br /> performGC<br /><br /> mapM_ wait (a:as)<br /></pre>This shows how handy and simple-to-use the <tt>Monoid</tt> class is.<br /><br />There are also new type class instances for the <tt>Output</tt> type as well, such as <tt>Monad</tt> and <tt>Alternative</tt>! This is a great example of how the mailboxes make a more useful and theoretically interesting interface for concurrency than the streams.<br /><br /><br /><h4>Conclusions</h4><br />This update adds some cool new features, but is very unlikely to break code. The main backwards-incompatible change was renaming the <tt>Size</tt> type to <tt>Buffer</tt> (since <tt>Size</tt> does not make sense for <tt>Latest</tt>), but other than that most existing code should work without any modifications.Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com2tag:blogger.com,1999:blog-1777990983847811806.post-17906934695703792332013-06-02T12:54:00.003-07:002013-06-02T12:59:55.456-07:00pipes-parse-1.0.0: Pushback, delimited parsers, resumable parsing, and lenses<tt>pipes-parse</tt> is finally out! <tt>pipes</tt> users know that <tt>pipes</tt> has lagged behind <tt>conduit</tt> and <tt>io-streams</tt> in the parsing arena and this library provides the utilities necessary to close the gap. You can find the <a href="http://hackage.haskell.org/package/pipes-parse"><tt>pipes-parse</tt> library here</a>, and I recommend reading <a href="http://hackage.haskell.org/packages/archive/pipes-parse/1.0.0/doc/html/Control-Proxy-Parse-Tutorial.html">the tutorial</a>. This post will mainly discuss the development of <tt>pipes-parse</tt> and compare it to parsing solutions from other streaming libraries.<br /><br /><br /><h4>End of Input</h4><br /><tt>pipes-parse</tt> copies both <tt>io-streams</tt> and <tt>conduit</tt> for the end of input protocol: wrap values in <tt>Just</tt> and end with a stream of <tt>Nothing</tt>s. There are two ways you can modify an input stream to obey this protocol.<br /><br />The first approach is to use the <tt>wrap</tt> function, which enforces this protocol: <pre><br />wrap :: (Monad m, Proxy p)<br /> => p a' a b' b m r -> p a' a b' (Maybe b) m s<br /></pre>Then you can just write: <pre><br />wrap . producer >-> consumer<br /></pre><tt>wrap</tt> proves its termination safety by having a polymorphic return value (because it ends with a never-ending stream of <tt>Nothing</tt>s).<br /><br />The second approach is to rewrite your producer as an input stream in the style of <tt>io-streams</tt> (see <a href="http://www.haskellforall.com/2013/04/pipes-and-io-streams.html">this post</a> for details): <pre><br />source' :: (Monad m, Proxy p) => () -> Session p m (Maybe a)<br /></pre>... and use <tt>request</tt> composition to connect the producer: <pre><br />source' \>\ consumer<br /></pre>This approach proves its termination safety by virtue of using <tt>request</tt> composition. The composition operator specializes to: <pre><br />(\>\) :: (Monad m, Proxy p)<br /> => (() -> Session p m (Maybe a))<br /> -> (() -> Consumer p (Maybe a) m b )<br /> -> (() -> Session p m b )<br /></pre>The composite pipe's return value only derives from the downstream pipe (i.e. <tt>consumer</tt> in this case). This is because <tt>request</tt> composition is automatically safe against termination from the upstream pipe. In the above example, <tt>source'</tt> just replaces every <tt>request</tt> within <tt>consumer</tt> and if <tt>source'</tt> terminates all that means is that the <tt>request</tt> completes.<br /><br />What's nice is that both approaches are 100% compatible with each other. You, the pipe writer, do not need to anticipate which way users will supply input. You just write a pipe that consumes values of type <tt>Maybe a</tt> and both of the above approaches will work with your pipe. Also, both of these approaches guarantee that you can return values directly from the downstream pipe without guarding the return value with a <tt>Maybe</tt>.<br /><br /><br /><h4>Pushback and Leftovers</h4><br /><tt>pipes</tt> implements pushback using the <tt>StateP</tt> proxy transformer: <pre><br />-- Like @request ()@, except try to use the leftovers buffer first<br />draw :: (Monad m, Proxy p)<br /> => StateP [a] p () (Maybe a) y' y m (Maybe a)<br /><br />-- Push an element back onto the leftovers buffer <br />unDraw :: (Monad m, Proxy p)<br /> => a -> StateP [a] p x' x y' y m ()<br /></pre>This is a great example of how the proxy transformer system makes it easy to extend <tt>pipes</tt> with new features without baking them into the core implementation. I can use the (newly-fixed) <tt>StateP</tt> proxy transformer to add a leftovers buffer that <tt>draw</tt> and <tt>unDraw</tt> both use.<br /><br />Pushback is where <tt>pipes-parse</tt> <b>significantly</b> improves on the competition. To motivate the <tt>pipes-parse</tt> solution, consider the type for <tt>conduit</tt>'s most general composition operator: <pre><br />(>+>) :: Monad m<br /> => Pipe l a b r0 m r1<br /> -> Pipe Void b c r1 m r2<br /> -> Pipe l a c r0 m r2<br />-- ^<br />-- |<br />-- +-- Leftovers<br /></pre>The downstream conduit cannot provide leftovers because they will be lost after composition. With <tt>pipes-parse</tt> you can save leftovers from both composed pipes very easily. To see how, imagine we have the following two pipe types: <pre><br />p1 :: (Monad m, Proxy p)<br /> => () -> Pipe (StateP [a] p) (Maybe a) (Maybe b) m r<br />p2 :: (Monad m, Proxy p)<br /> => () -> Pipe (StateP [b] p) (Maybe b) (Maybe c) m r<br />-- ^<br />-- |<br />-- +-- Leftovers<br /></pre>Each of these pipes stores a leftovers buffer equal to its input type, but we can't yet compose these pipes because their leftovers buffers don't match. However, <tt>pipes-parse</tt> provides lens support in the form of the <tt>zoom</tt> function so that you can easily unify two leftovers buffers in order to compose them: <pre><br />zoom _fst . p1<br /> :: (Monad m, Proxy p)<br /> => () -> Pipe (StateP ([a], [b]) p) (Maybe a) (Maybe b) m r<br />zoom _snd . p2<br /> :: (Monad m, Proxy p)<br /> => () -> Pipe (StateP ([a], [b]) p) (Maybe b) (Maybe c) m r<br /><br />zoom _fst . p1 >-> zoom _snd . p2<br /> :: (Monad m, Proxy p)<br /> => () -> Pipe (StateP ([a], [b]) p) (Maybe a) (Maybe c) m r<br /></pre>But you can do more than that! You can still access the leftovers buffers afterwards, too, again using <tt>zoom</tt>: <pre><br />example = do<br /> (zoom _fst . p1 >-> zoom _snd . p2) ()<br /> -- Draw, reusing the leftovers from @p1@<br /> ma <- zoom _fst draw<br /> -- Retrieve the leftovers from @p2@<br /> mb <- zoom _snd get<br /> ...<br /></pre>This kind of multiple-buffer management isn't possible using <tt>conduit</tt>.<br /><br /><tt>zoom</tt> is a perfect example of the <a href="http://www.haskellforall.com/2012/09/the-functor-design-pattern.html">functor design pattern</a>. We lift two existing proxies to agree on a common global state for compatibility purposes. Therefore, we expect that three should be functor laws at play: <pre><br />zoom id = id<br /><br />zoom (f . g) = zoom f . zoom g<br /></pre><br /><tt>pipes</tt> also improves upon <tt>io-streams</tt> pushback, too. With <tt>io-streams</tt> all the push-back is done using <tt>IORef</tt>s, meaning that: <ul><li> It isn't pure <li> You can't easily control which streams share leftovers and which ones do not <li> None of the state is reflected in the types </ul>With <tt>pipes-parse</tt> you get pure and precise control over leftovers. Moreover, you do not need to instrument streams to correctly forward values that you push back upstream, because <tt>StateP</tt> abstracts over that for you.<br /><br /><br /><h4>Nesting and delimiting parsers</h4><br />Like other streaming libraries, <tt>pipes-parse</tt> makes it very easy to run a parser on a subset of the stream. This was probably the #1 feature requested, followed shortly by...<br /><br /><br /><h4>Resumable parsing</h4><br /><tt>pipes-parse</tt> uses <tt>StateP</tt>, so if you want to interrupt parsing you can just use <tt>runStateK</tt> to return the current state of the leftovers for use in a later computation. Simple!<br /><br /><br /><h4>Perfect streaming</h4><br />One of the more advanced features to come out of the last wave of development was what I like to call "perfect streaming". This has a very specific meaning: grouping the input and interacting with each group as a stream without bringing more than one chunk into memory.<br /><br />For example, consider the following conduit: <pre><br />lines :: Monad m => Conduit ByteString m ByteString<br /></pre>This will load each line into memory, which means that if your file is one long line then you will load the entire file into memory, defeating the purpose of streaming! <tt>io-streams</tt> has the same problem, but, unlike <tt>conduit</tt>, <tt>io-streams</tt> can easily fix its <tt>lines</tt> utility to stream perfectly and I plan to show Greg how to do this so that <tt>io-streams</tt> users can benefit from the same trick.<br /><br /><tt>pipes-parse</tt> does not teach how to use this trick, but it does lay the groundwork for it and the upcoming <tt>pipes-bytestring</tt> library will provide examples of this idiom. If you want to see a concrete example of this trick in action, check out <a href="https://github.com/ocharles/pipes-tar/blob/master/src/Control/Proxy/Tar.hs">Oliver Charles's upcoming <tt>pipes-tar</tt> library</a> on Github to see a preview of this idiom, where he streams individual files from a TAR archive without ever loading more than one chunk in memory. His very interesting use case was the inspiration for this trick, and I also preview this idiom in this <a href="http://stackoverflow.com/a/16794672/1026598">Stack Overflow answer</a>.<br /><br />More generally, perfect streaming uses the <tt>respond</tt> category's composition operator, which has the following general type: <pre><br />(/>/) :: (Monad m, Proxy p)<br /> => (a -> p x' x b' b m a')<br /> -> (b -> p x' x c' c m b')<br /> -> (a -> p x' x c' c m a')<br /></pre>When you use <tt>respond</tt> composition, both pipes share the same upstream interface meaning that you can group the input into subsections but still allow each subsection to access the original upstream interface. With appropriate information hiding you can set up pipes which behave like lenses to specific subsections of the stream and allow the user to stream from each subsection independently.<br /><br /><br /><h4>Compatibility</h4><br /><tt>pipes-parse</tt> takes great care to ensure that non-parsing pipes are completely interoperable with parsing pipes, thanks to the following compatibility functions: <pre><br />fmapPull<br /> :: (Monad m, Proxy p)<br /> => (x -> p x a x b m r)<br /> -> (x -> p x (Maybe a) x (Maybe b) m r)<br /><br />returnPull<br /> :: (Monad m, Proxy p)<br /> => x -> p x a x (Maybe a) m r<br /><br />bindPull<br /> :: (Monad m, Proxy p)<br /> => (x -> p x a x (Maybe b) m r)<br /> -> (x -> p x (Maybe a) x (Maybe b) m r)<br /></pre>These three functions define functors and monads in the category where the objects are the downstream components of each proxy interface and the morphisms are pull-based pipes.<br /><br />As you might guess, <tt>fmapPull</tt> satisfies the functor laws: <pre><br />fmapPull f >-> fmapPull g = fmapPull (f >-> g)<br /><br />fmapPull pull = pull<br /></pre>Similarly, <tt>returnPull</tt> and <tt>bindPull</tt> satisfy the monad laws: <pre><br />-- Using: f >>> g = f >-> bindPull g<br /><br />returnPull >>> f = f<br /><br />f >>> returnPull = f<br /><br />(f >>> g) >>> h = f >>> (g >>> h)<br /></pre>... which are equivalent to: <pre><br />returnPull >-> bindPull f = f<br /><br />bindPull returnPull = pull<br /><br />bindPull (f >-> bindPull g) = bindPull f >-> bindPull g<br /></pre>... and we can derive the functor from the monad: <pre><br />fmapPull f = bindPull (f >-> returnPull)<br /><br />-- i.e. fmap f = (>>= return . f)<br /></pre>These functions could replace <tt>Maybe</tt> and parametrize it with a type class like: <pre><br />class FunctorPull f where<br /> fmapPull<br /> :: (Monad m, Proxy p)<br /> => (x -> p x a x b m r)<br /> -> (x -> p x (f a) x (f b) m r)<br /></pre>... and there is a sensible instance for <tt>Either</tt>, too (in fact, that's how <tt>rightD</tt> from the <tt>pipes</tt> prelude works). However, I decided to keep them monomorphic for now for simplicity.<br /><br /><br /><h4>Conclusion</h4><br /><tt>pipes-parse</tt>, like most <tt>pipes</tt> libraries, keeps the best spirit of Haskell programming by: <ul><li> composing features from smaller, simpler, and correct building blocks, <li> using higher-order functions to lift existing functions for compatibility, <li> isolating features from each other to statically prevent accidental complexity </ul><tt>pipes-parse</tt> is the last of the three core libraries, the other two being <tt>pipes-safe</tt> and <tt>pipes-concurrency</tt>. These libraries define the central idioms for the <tt>pipes</tt> ecosystem and they were all designed to be instructive and convention-setting in areas where there isn't a perfectly elegant solution and some pragmatic trade-offs had to be made.<br /><br />The completion of these libraries marks the point where I feel the core <tt>pipes</tt> API has proven itself to be sufficiently versatile and future-proof. The proxy transformer system makes the central API unusually stable because I don't need to bake in any new features that I want to add.<br /><br />This means I will be upgrading <tt>pipes</tt> to version <tt>4.0</tt> soon to mark the transition to a stabler API in preparation for eventual inclusion in the Haskell platform. Also, most development work will shift to derived libraries now.<br /><br />That does not mean that the derived libraries are complete, yet. For example, I am currently writing up a <tt>pipes-safe-2.0.0</tt> which will feature improved promptness guarantees and eliminate the need for unsafe finalization primitives. Similarly, I am about to release a <tt>pipes-safe-1.2.0</tt> at the end of this week which will add broadcasts and continuous behaviors. More generally, I will only consider the derived core libraries to be mature when more code is built on top of them on the scale of what <tt>conduit</tt> has right now.<br /><br />The next library on the development docket is <tt>pipes-bytestring</tt>. Now that <tt>pipes-parse</tt> is complete I feel much more comfortable about the stability of <tt>pipes-bytestring</tt> API. Also, <tt>pipes</tt> now has <a href="mailto:haskell-pipes@googlegroups.com">an official mailing list</a> where you can ask questions, follow <tt>pipes</tt> development, or offer feedback and contribute to upcoming <tt>pipes</tt> libraries.Gabriel Gonzalezhttps://plus.google.com/100052931757670747819noreply@blogger.com0