Sunday, May 27, 2012

Conduit bugs

I want to preface this by saying that this post is not intended to be mean-spirited and I will offer some insights on how to fix these problems.

I often find that violations of type-class laws almost invariably lead to subtle bugs, which is why I go to so much effort to ground my pipes library in theoretically-derived type-classes and enforce their corresponding laws very strictly. In this post I'm going to illustrate the importance of this for conduit by showing several conduit bugs and demonstrating how they stem directly from type-class law violations. I based these bugs on conduit-0.4.2.

Conduits violate the Monad laws

... specifically, associativity:
printer = NeedInput (\a -> lift (print a) >> printer) (return ())

residue x = Done (Just x) ()
>>> yield 1 $$ printer
>>> yield 1 $$ (residue 2 >> residue 3) >> printer
>>> yield 1 $$ residue 2 >> (residue 3 >> printer)
The left-associative grouping accidentally drops residual input. There are two cases where conduits suffer from data loss, and one of them corresponds to violating the associativity law for monads.

The next example is a bigger problem:
mySink = NeedInput (\_ -> return ())
                   (lift $ putStrLn "Finalize Resource)
>>> yield 1 $$ (yield 1 >> (residue 3 >> mySink)) =$ return ()
-- No output
>>> yield 1 $$ ((yield >> residue 3) >> mySink) =$ return ()
Finalize Resource
The right-associative grouping accidentally drops the finalizer. Conduits don't guarantee deterministic finalization in certain corner cases, again because they violate the monad laws.

More generally, any conduit that sequences two buffered conduits is prone to the above two bugs. The source of both violations is the attempt to "push back" input by attaching residual input to the Done constructor. I can't yet offer a solution for this because I have not yet tackled the parsing problem, however removing this feature will at least prevent the above case of dropped finalizers.

Conduits violate the Category laws

I know that conduits do not advertise being a Category, but I investigated where conduits deviated from a Category and found more bugs in the process.

For example, conduits lack an identity conduit. This is apparent from the type of conduit composition:
(=$=) :: Monad m
 => Pipe a b m () <-- This needs to be fixed
 -> Pipe b c m r
 -> Pipe a c m r
The upstream conduit is constrained to return (), so you immediately lose the ability to return values from upstream like you can with pipes and you cannot form a proper upstream identity pipe. However, we could type-restrict composition to only return () and see if that forms a category, even if not necessarily as powerful as the pipes category:
(=$=) :: Monad m
 => Pipe a b m ()
 -> Pipe b c m ()
 -> Pipe a c m ()
The closest this composition comes to an identity conduit is:
idC = await >>= maybe (return ()) (\a -> yield a >> idC)
Unfortunately, idC does not serve as an identity conduit when you place it upstream of a conduit:
inject x p = case p of
    NeedInput _ p' -> yield x =$= p';
    p' -> p'

print2 = replicateM_ 2 (await >>= lift . print)
>>> yield 1 $$ inject 2 print2
Just 2
>>> yield 1 $$ inject 2 (idC $= print2)
This seemingly innocuous identity law violation actually represents a known subtle bug in conduits where if they are fed a Nothing followed by a Just x they break. Some conduit library code depends on this scenario never happening in order to guarantee safety and correctness. Michael already knows this, because he notes this in the comments of composition:
-- [...] However, it is not
-- recommended to give input to a pipe after it has
-- been told there is no more input. [...]
I find it interesting that despite not attempting to make conduits a category, the invariant Michael requires is precisely the invariant necessary to make the identity law work. This exemplifies how even when we don't program using theoretical constructs, our intuition for correct behavior exactly matches the laws for the theoretically-grounded classes.

The lack of a proper upstream identity is also a source of data loss. When you compose two pipes, the residual input for the downstream pipe is discarded under all circumstances, and there is no way to solve this without introducing a proper upstream identity conduit.

pipes has solved this issue, but the solution is precisely what necessitated the parametrized monad and I can discuss this with Michael if he is interested because I think he would not be as reluctant to use extensions to allow do notation for parametrized monads. The other advantage of switching to the pipes solution is that you gain the ability to finalize upstream early without terminating, which conduit does not presently have.

Conduits violate Monad Transformer laws

This is my fault. The first release of my library violated the monad transformer laws and while I've fixed it in version 2.0, conduit still has the original version found in pipes.

The original version violates both the identity and composition laws of monad transformers. You can see this if you just count the number of PipeM constructors generated:
count' n c = case c of
    Done i _ -> Done i n
    PipeM m h -> PipeM (liftM (count (n + 1)) m) h
    HaveOutput c' h o -> HaveOutput (count n c') h o
    NeedInput f c' -> NeedInput (fmap (count n) f) (count n c')

count = count' 0
-- lift (return x) /= return x
>>> runPipe $ count $ lift $ return (-99)
>>> runPipe $ count $ return (-99)

-- lift (m1 >> m2) /= lift m1 >> lift m2
>>> runPipe $ count $ lift $ return (-99) >> return (-99)
>>> runPipe $ count $ lift (return (-99)) >> lift (return (-99))
This is the reason I switched to using an approach based on a free monad transformer (i.e. FreeT), instead of using an ordinary free monad.

This makes the monad bind mandatory at each step, which actually leads to very little degradation of performance, so I had no problem making the switch. More importantly, though, it allows more powerful optimizations using rewrite rules than are currently possible when the monad is optional. For example, the following rewrite rule is safe once you have a correct instance for MonadTrans:
  "forever alone" forall m.  forever (lift m) = lift (forever m)
I will discuss rewrite rules later in a post about optimizing pipes and frames, but one of my big motivations for strictly enforcing all the class laws as strong as possible is that it then permits rewrites so powerful that all the pipe code completely disappears, leaving behind the hand-written loop code (think: stream fusion on steroids).

Other bugs

There are other bugs in conduits that I found in the process, mainly associated with registering finalizers. For example, conduits has a nasty habit of releasing resources multiple times, but this is hidden by the ResourceT machinery which ignores duplicate resource release requests.

However, this is quite easy for conduits to fix. All you do is remove the finalizer field from the PipeM constructor and have pipeClose ignore PipeM actions completely, only associating finalizers with HaveOutput. This fixes the issue of multiple finalizations and also fixes the issue of accidentally drawing one last chunk of data before finalizing if you aren't careful about writing the source.

Even if you fix that, though, there are still other finalization problems. For example, while Michael provides a mechanism for bidirectionally-safe finalization, he never uses it for sources and sinks, meaning that if you elevate them to conduits they won't finalize correctly, however this is easy to fix and he will know how to do it. He also never exposes any utility functions for bidirectionally-safe finalization that is an equivalent to the finallyP in pipes.


This post presents some examples where if you are close to a theoretically-grounded type-class but not quite there, the difference is most likely a bug. This means that you can identify bugs in a library rapidly just by examining where it approximates theoretical type-classes and then studying where they fail to observe the corresponding class laws.

No comments:

Post a Comment