Thursday, January 29, 2015

Use Haskell for shell scripting

Right now dynamic languages are popular in the scripting world, to the dismay of people who prefer statically typed languages for ease of maintenance.

Fortunately, Haskell is an excellent candidate for statically typed scripting for a few reasons:

  • Haskell has lightweight syntax and very little boilerplate
  • Haskell has global type inference, so all type annotations are optional
  • You can type-check and interpret Haskell scripts very rapidly
  • Haskell's function application syntax greatly resembles Bash

However, Haskell has had a poor "out-of-the-box" experience for a while, mainly due to:

  • Poor default types in the Prelude (specifically String and FilePath)
  • Useful scripting utilities being spread over a large number of libraries
  • Insufficient polish or attention to user experience (in my subjective opinion)

To solve this, I'm releasing the turtle library, which provides a slick and comprehensive interface for writing shell-like scripts in Haskell. I've also written a beginner-friendly tutorial targeted at people who don't know any Haskell.

Overview

turtle is a reimplementation of the Unix command line environment in Haskell. The best way to explain this is to show what a simple "turtle script" looks like:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import Turtle

main = do
    cd "/tmp"
    mkdir "test"
    output "test/foo" "Hello, world!"  -- Write "Hello, world!" to "test/foo"
    stdout (input "test/foo")          -- Stream "test/foo" to stdout
    rm "test/foo"
    rmdir "test"
    sleep 1
    die "Urk!"

If you make the above file executable, you can then run the program directly as a script:

$ chmod u+x example.hs
$ ./example.hs
Hello, world!
example.hs: user error (Urk!)

The turtle library renames a lot of existing Haskell utilities to match their Unix counterparts and places them under one import. This lets you reuse your shell scripting knowledge to get up and going quickly.

Shell compatibility

You can easily invoke an external process or shell command using proc or shell:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import Turtle

main = do
    mkdir "test"
    output "test/file.txt" "Hello!"
    proc "tar" ["czf", "test.tar.gz", "test"] empty

    -- or: shell "tar czf test.tar.gz test" empty

Even people unfamiliar with Haskell will probably understand what the above program does.

Portability

"turtle scripts" run on Windows, OS X and Linux. You can either compile scripts as native executables or interpret the scripts if you have the Haskell compiler installed.

Streaming

You can build or consume streaming sources. For example, here's how you print all descendants of the /usr/lib directory in constant memory:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import Turtle

main = view (lstree "/usr/lib")

... and here's how you count the number of descendants:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import qualified Control.Foldl as Fold
import Turtle

main = do
    n <- fold (lstree "/usr/lib") Fold.length
    print n

... and here's how you count the number of lines in all descendant files:

#!/usr/bin/env runhaskell

{-# LANGUAGE OverloadedStrings #-}

import qualified Control.Foldl as Fold
import Turtle

descendantLines = do
    file <- lstree "/usr/lib"
    True <- liftIO (testfile file)
    input file

main = do
    n <- fold descendantLines Fold.length
    print n

Exception Safety

turtle ensures that all acquired resources are safely released in the face of exceptions. For example, if you acquire a temporary directory or file, turtle will ensure that it's safely deleted afterwards:

example = do
    dir <- using (mktempdir "/tmp" "test")
    liftIO (die "The temporary directory will still be deleted!")

However, exception safety comes at a price. turtle forces you to consume all streams in their entirety so you can't lazily consume just the initial portion of a stream. This was a tradeoff I chose to keep the API as simple as possible.

Patterns

turtle supports Patterns, which are like improved regular expressions. Use Patterns as lightweight parsers to extract typed values from unstructured text:

$ ghci
>>> :set -XOverloadedStrings
>>> import Turtle
>>> data Pet = Cat | Dog deriving (Show)
>>> let pet = ("cat" *> return Cat) <|> ("dog" *> return Dog) :: Pattern Pet
>>> match pet "dog"
>>> [Dog]
>>> match (pet `sepBy` ",") "cat,dog,cat"
[[Cat,Dog,Cat]]

You can also use Patterns as arguments to commands like sed, grep, find and they do the right thing:

>>> stdout (grep (prefix "c") "cat")             -- grep '^c'
cat
>>> stdout (grep (has ("c" <|> "d")) "dog")      -- grep 'cat\|dog'
dog
>>> stdout (sed (digit *> return "!") "ABC123")  -- sed 's/[[:digit:]]/!/g'
ABC!!!

Unlike many Haskell parsers, Patterns are fully backtracking, no exceptions.

Formatting

turtle supports typed printf-style string formatting:

>>> format ("I take "%d%" "%s%" arguments") 2 "typed"
"I take 2 typed arguments"

turtle even infers the number and types of arguments from the format string:

>>> :type format ("I take "%d%" "%s%" arguments")
format ("I take "%d%" "%s%" arguments") :: Text -> Int -> Text

This uses a simplified version of the Format type from the formatting library. Credit to Chris Done for the great idea.

The reason I didn't reuse the formatting library was that I spent a lot of effort keeping the types as simple as possible to improve error messages and inferred types.

Learn more

turtle doesn't try to ambitiously reinvent shell scripting. Instead, turtle just strives to be a "better Bash". Embedding shell scripts in Haskell gives you the the benefits of easy refactoring and basic sanity checking for your scripts.

You can find the turtle library on Hackage or Github. Also, turtle provides an extensive beginner-friendly tutorial targeted at people who don't know any Haskell at all.

22 comments:

  1. Cool. :-) Looks very attractive, newbies and scripters alike should like it. Is there a story for piping like in shell-conduit?

    I made a major version bump to formatting (6.2.0) to include this simplification, I'd been meaning to drop the Holey type for a while after it became clear abstracting over the particular monoid wasn't useful. Oh, I discovered this nifty Monoid instance recently, check it out: https://github.com/chrisdone/formatting#using-more-than-one-formatter-on-the-same-argument

    ReplyDelete
  2. This isn't a meaningful alternative to bash until you show equivalents to at least these bash operators:

    | (pipe)
    & (background job) and wait
    <() (create fifo pulling input from subshell)

    Most shell scripts are gluing together other commands written in a variety of different languages. It's an orchestration language with simple job control and trivial parallelization via pipes.

    At best, this currently seems to be a replacement for DOS's batch interpreter, which is scarcely even a scripting language.

    ReplyDelete
    Replies
    1. <() is basically function application.

      | and & are provided by lazyness.

      Delete
    2. This comment has been removed by the author.

      Delete
    3. I'm sure the author of Haskell's famed pipes library could provide an implementation of `|` :) As for `&` I would do that with the async library in a way that, in my opinion, if very readable:

      main = do
      a <- async procA
      b <- async procB
      wait a
      print "At least A is done"
      wait b
      print "Both A and B are done"

      There's probably a way to abstract things so`&` works as expected.

      I can't comment on <(), don't know Bash well enough.

      Delete
    4. `inshell` and `inproc` are how you embed external commands as streams within your program. Here's a contrived example:

      > stdout (inproc "cat" [] stdin)

      That will take `stdin`, divert its stream through the external `cat` command, and then return the stream back to Haskell-land where it can be consumed by `stdout`. You can chain as many of these commands as you want in a row this way:

      > stdout (inproc "cmd2" ["a", "b"] (inproc "cmd1" ["x", "y"] stdin))

      That would be analogous to:

      > cmd1 x y | cmd2 a b

      So the precise answer to "What is `|`?" is "function application", since you are just chaining functions that transform streams. These functions all have type:

      Shell Text -> Shell Text

      The answer to "What is `&`?" is the `fork` command from `Turtle.Prelude`. That lets you fork a background process and it returns a reference to the process (i.e. an `Async`) that you can use to manage it:

      fork :: IO a -> Managed (Async a)

      The answer to "What is <()`" is just "function application" again. In turtle you don't pass streams by reference, but rather by value. That means there is no need for process substitution in the first place. I can just pass the lazy stream directly as an argument since it's a first-class value in Haskell.

      Delete
  3. Great idea! I applaud the effort. I as well would request pipe as it's a mandatory feature.

    ReplyDelete
  4. What if we had a quasi-quoter from bash syntax to turtle? It wouldn't help for the interpreted case, but it would make writing compiled scripts dead easy, since your target audience presumably already knows that syntax. Then they'd have the power of Haskell within reach when they were ready for it.

    ReplyDelete
  5. Why some of turtle functions are IO actions, and some are Shell actions? How to unify it? How about everything in Shell and `main = sh $ do ...`?

    ReplyDelete
  6. I actually did consider wrapping all `IO` actions in `liftIO`. There are actually two ways to do this:

    A) Use the inferred type of:

    > MonadIO m => m r

    B) Specialize the type to `Shell`:

    > Shell r

    There were two problems with approach `A`:

    * I wanted concrete types to teach users with instead of type classes. One of the design constraints of the library was that it had to be teachable to absolute Haskell beginners.
    * It's a leaky abstraction. Not everything `IO` action in the Haskell ecosystem is wrapped in `liftIO`, so the moment they deviate from the turtle library they are suddenly hit with `liftIO` anyway.

    There were a few problems with approach `B`:

    * There's no function of type `Shell a -> IO a`. The best you can do is `Shell a -> IO (Maybe a)` or `Shell a -> IO [a]`. That means that once you wrap an `IO` action in `Shell` the result can no longer be reused within a naked `IO` block.
    * This actually makes the teaching process rougher on new Haskell programmers because I don't have any useful `IO` actions within the library to use as examples to explain how `IO` works.

    Generally there is a tension between the "principle of least permission" (giving things the narrowest types possible) and the proliferation of too many types. I decided that I could live with teaching new users the distinction between `IO` and `Shell` (since they'll have to learn it anyway if they want to use other Haskell libraries).

    I have a lot more to say on this subject, but a good starting point is my post on the Functor Design Pattern (http://www.haskellforall.com/2012/09/the-functor-design-pattern.html), which talks about how I prefer to specialize components as much as possible and delay unifying their types until the very last moment. Vice versa, I discourage pre-unifying everything into a monolithic component framework.

    ReplyDelete
    Replies
    1. Ok. So how about to get rid of Shell monad? When something like `ls` multiplies continuation, it is counter-intuitive. Yes, I know about list monad, but newcomers don't, and `ls` in bash doesn't multiply ther rest of a script. Maybe `IO (Stream a)` fits best?

      Delete
    2. I don't think the list monad is counter-intuitive to beginners. Most of these people will be familiar with for loops or Python's list comprehensions, so they can easily make the connection. I haven't had anybody complain that they actually found it counter-intuitive. Quite the opposite: Python programmers have complimented that it's easier to use than Python's API where you have to pick between easy and non-streaming or complex and streaming.

      Delete
    3. List monad is not counter-intuitive for pythonists. Implicit loops are weird for bash scripters.

      Delete
  7. This comment has been removed by the author.

    ReplyDelete
  8. One of the main reasons I don't use Haskell for shell scripting is that I like having standalone scripts that work out-of-the-box. I often write shell scripts for things like configuration and installation where you can't assume anything but a minimal POSIX system. Using Haskell for this is troublesome because GHC needs to be installed first, whereas Bourne shell is available on all Unix flavors. This would be fine for doing Haskell-related tasks, but asking a user to install GHC just to do something entirely unrelated is a bit too much.

    I think there is a way to get the best of both worlds though, if there is some sort of Haskell-to-shell compiler. It doesn't even have to be fancy … an EDSL would probably work.

    ReplyDelete
    Replies
    1. But then you would need that Haskell-to-shell compiler to be installed on your system. Why not just install GHC instead of that Haskell-to-shell compiler?

      Also, keep in mind that you can compile an executable that you can distribute to users if they don't have GHC installed.

      The more long-term solution to this problem is to petition your distro maintainers to include GHC (or some by default with the distro.

      Delete
  9. Gabriel, you just made my day. I was using shelly for some scripting tasks, but I like your library much more. I've been playing with it for the last few hours and I already love it :-) Thank you!

    ReplyDelete
  10. Dear Gabriel,

    I would like to download some files in parallel but can't get it working. Code works nice
    without the "T.sh . T.using . T.fork" part. No separate processes are spawn. I don't know
    how to fix the problem. Can you help me out?

    Thanks.

    Source code is here:
    https://gist.github.com/schmidh/fa6112719e08c626db44

    ReplyDelete
    Replies
    1. `fork` automatically cancels the thread once the `Managed` computation is done, so what your code was doing was creating the thread and then immediately canceling it afterwards.

      The most direct way to do what you want is to use the `async` library that `turtle` is based on top of. It provides the `mapConcurrently` function:

      http://hackage.haskell.org/package/async-2.0.2/docs/Control-Concurrent-Async.html#v:mapConcurrently

      ... which you can use like this:

      mapConcurrently (\vid -> shellStrict ...) (videos vids)

      ... and that will run all the commands in parallel.

      You can do this with `turtle`'s `fork` but it would be trickier. You would have to do something like this:

      loop :: [Video] -> Managed ()
      loop (vid:vids) = do
      async <- using (fork (shell ... vid ...))
      loop vids
      liftIO (wait async)
      loop [] = return ()

      ... and then that would make sure that they all get fired off in parallel and the computation waits for the result.

      I should also update the documentation for `fork` to explain the auto-cancellation policy when the current thread is done. It's internally implemented in terms of `withAsync` from the `async` library which does the same thing.

      Delete
    2. Hi Gabriel,

      thanks a lot!!!

      Delete