Saturday, May 21, 2016

A command-line benchmark tool

I wrote a small program named bench that lets you benchmark other programs from the command line. Think of this as a much nicer alternative to the time command.

The best way to illustrate how this works is to show a few example uses of the program:

$ bench 'ls /usr/bin | wc -l'  # You can benchmark shell pipelines
benchmarking ls /usr/bin | wc -l
time                 6.756 ms   (6.409 ms .. 7.059 ms)
                     0.988 R²   (0.980 R² .. 0.995 R²)
mean                 7.590 ms   (7.173 ms .. 8.526 ms)
std dev              1.685 ms   (859.0 μs .. 2.582 ms)
variance introduced by outliers: 88% (severely inflated)

$ bench 'sleep 1'  # You have to quote multiple tokens
benchmarking sleep 1
time                 1.003 s    (1.003 s .. 1.003 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 1.003 s    (1.003 s .. 1.003 s)
std dev              65.86 μs   (0.0 s .. 68.91 μs)
variance introduced by outliers: 19% (moderately inflated)

$ bench true  # The benchmark overhead is below 1 ms
benchmarking true
time                 383.9 μs   (368.6 μs .. 403.4 μs)
                     0.982 R²   (0.971 R² .. 0.991 R²)
mean                 401.1 μs   (386.9 μs .. 418.9 μs)
std dev              54.39 μs   (41.70 μs .. 67.62 μs)
variance introduced by outliers: 87% (severely inflated)

This utility just provides a command-line API for Haskell's criterion benchmarking library. The bench tool wraps any shell command you provide in a a subprocess and benchmarks that subprocess through repeated runs using criterion. The number of runs varies between 10 to 10000 times depending on how expensive the command is.

This tool also threads through the same command-line options that criterion accepts for benchmark suites. You can see the full set of options using the --help flag:

$ bench --help
Command-line tool to benchmark other programs

Usage: bench COMMAND ([-I|--ci CI] [-G|--no-gc] [-L|--time-limit SECS]
             [--resamples COUNT] [--regress RESP:PRED..] [--raw FILE]
             [-o|--output FILE] [--csv FILE] [--junit FILE]
             [-v|--verbosity LEVEL] [-t|--template FILE] [-m|--match MATCH]
             [NAME...] | [-n|--iters ITERS] [-m|--match MATCH] [NAME...] |
             [-l|--list] | [--version])

Available options:
  -h,--help                Show this help text
  COMMAND                  The command line to benchmark
  -I,--ci CI               Confidence interval
  -G,--no-gc               Do not collect garbage between iterations
  -L,--time-limit SECS     Time limit to run a benchmark
  --resamples COUNT        Number of bootstrap resamples to perform
  --regress RESP:PRED..    Regressions to perform
  --raw FILE               File to write raw data to
  -o,--output FILE         File to write report to
  --csv FILE               File to write CSV summary to
  --junit FILE             File to write JUnit summary to
  -v,--verbosity LEVEL     Verbosity level
  -t,--template FILE       Template to use for report
  -m,--match MATCH         How to match benchmark names ("prefix" or "glob")
  -n,--iters ITERS         Run benchmarks, don't analyse
  -m,--match MATCH         How to match benchmark names ("prefix" or "glob")
  -l,--list                List benchmarks
  --version                Show version info

The --output option is really useful: it outputs an HTML page with a chart showing the distribution of run times. For example, the following command:

$ bench 'ls /usr/bin | wc -l' --output example.html
benchmarking ls /usr/bin | wc -l
time                 6.716 ms   (6.645 ms .. 6.807 ms)
                     0.999 R²   (0.999 R² .. 0.999 R²)
mean                 7.005 ms   (6.897 ms .. 7.251 ms)
std dev              462.0 μs   (199.3 μs .. 809.2 μs)
variance introduced by outliers: 37% (moderately inflated)

Also produces something like the following chart which you can view in example.html:

You can also increase the time limit using the --time-limit option, which will in turn increase the number of runs for better statistics. For example, criterion warned me that I had too many outliers for my benchmarks, so I can increase the time limit for the above benchmark to 30 seconds:

$ bench 'ls /usr/bin | wc -l' --time-limit 30 --output example.html
benchmarking ls /usr/bin | wc -l
time                 6.937 ms   (6.898 ms .. 7.002 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 6.905 ms   (6.878 ms .. 6.935 ms)
std dev              114.9 μs   (86.59 μs .. 156.1 μs)

... which dials up the number of runs to the ~4000 range, reduces the number of outliers, and brings down the standard deviation by a factor of four:

Keep in mind that there are a few limitations to this tool:

  • this tool cannot accurately benchmark code that requires a warm up phase (such as JVM programs that depend on JIT compilation for performance)
  • this tool cannot measure performance below about half a millisecond due to the overhead of launching a subprocess and bash interpreter

Despite those limitations, I find that this tool comes in handy in a few scenarios:

  • Preliminary benchmarking in the prototyping phase of program development
  • Benchmarking program pipelines written in multiple languages

You can install this tool by following the instructions on the Github repo:

Or if you have the Haskell stack tool installed you can just run:

$ stack update
$ stack install bench

5 comments:

  1. Very nice! I've recently been benchmarking a load of programs, which plug together as pipelines in this way.

    I ended up making something which looks like a clunky version of this 'bench' command (although using environment variables rather than arguments), and a wrapper which allows toggling between that and 'time' (which allows much faster feedback for long running commands).

    With this, it looks like I can enjoy the most satisfying part of programming: deleting code which is no longer needed!

    ReplyDelete
  2. Very nice, thank you. I use https://github.com/simonmichael/hledger/blob/master/tools/simplebench.hs for displaying comparative benchmarks (and for obtaining rough measurements quickly, when criterion's many iterations would be overkill). I wonder if it would fit in bench.

    ReplyDelete
    Replies
    1. Yeah, I think it makes a lot of sense to support multiple benchmarks.

      Delete
  3. This page has some issues for mobile users; the code blocks are too wide for mobile phone screens, and the page does not allow you to scroll sidewise. Just a heads up. :)

    ReplyDelete
  4. This page has some issues for mobile users; the code blocks are too wide for mobile phone screens, and the page does not allow you to scroll sidewise. Just a heads up. :)

    ReplyDelete