Monday, December 19, 2022

Nixpkgs support for incremental Haskell builds

incremental

The context for this post is that at work I recently implemented Nix ecosystem support for “incrementally” building Haskell packages. By “incrementally” I mean that these Nix builds only need to build what changed since the last full build of the package so that the package doesn’t need to be built from scratch every time.

The pull requests implementing this feature have not yet been approved or merged at the time of this writing, but I figured that I would explain the motivation, design, results, and limitations of this work to hopefully persuade people that this work should be merged.

If you’re not interested in the design then you can skip straight to the Demo section below.

Background

I work on Mercury’s Backend Development User Experience team and we support developers contributing to a large Haskell monolith consisting of 3000+ modules. That may seem like a lot but the vast majority of these modules are small and the whole codebase takes ~14 minutes to compile in CI if we disable optimizations (although we still build with optimizations enabled for deployment).

In my experience, that’s pretty good for a Haskell project of this size, thanks not only to the work of our team but also other teams who also contribute to improving the development experience. In fact, the pioneering work for this “incremental builds” feature actually originated from two engineers outside our team.

First, Harry Garrood improved GHC’s change detection algorithm so that GHC would use the hash of the file to detect changes instead of using the timestamp. In this post he explains how you can make use of this to implement incremental builds for traditional CI services (e.g. GitHub actions) where each build reuses the intermediate build products from the prior build instead of building from scratch.

That alone would not be enough for us to use this at work since we use Nix where this sort of build impurity doesn’t fly. However, Harry and Jade Lovelace prototyped using this feature in Nixpkgs so that Nix builds of Haskell packages could also reuse intermediate build products from prior builds to save work. You can find their prototype here.

The basic idea behind the prototype Nixpkgs integration is that you split a Haskell package build into two separate builds:

  • A “full build” that builds the Haskell package from scratch

    This full build exports its intermediate build products (i.e. the dist directory) which can then be reused by:

  • An “incremental build” that only builds what changed since the full build

    This incremental build imports the intermediate build products from the corresponding full build so that it doesn’t have to build the package from scratch.

So you might wonder: if that was already implemented then what work still remained for me to do?

Problem

The main issue with the initial Nixpkgs integration is that it does not provide any support for selecting which Git revision to use as the basis for the full build. The existing solutions require some out-of-band process to automatically select and lock the appropriate git revision to use for the older (full) build.

Non-solution #0: Rolling rebuilds

The first non-solution is for each revision to always reuse the build products from the previous revision. This doesn’t work well with Nix because it would create an increasingly-long chain of dependent derivations; in order to build the most recent revision you’d have to build all preceding revisions.

The dilemma here is that Nix is forcing us to confront something that other build tools gloss over: if you’re always reusing build products from the last build then you can’t accurately reproduce the most recent build from scratch without reproducing all prior builds. You’ve essentially “contaminated” the current build with all prior builds by doing things in this way.

So what we really want is something more like this:

Periodically do a full build from scratch and then make each incremental build relative to the last full rebuild.

That’s much more compatible with Nix because then we only need to do two builds of our project if we rebuild things from scratch, instead of one build for every revision in our project’s history.

There’s also another issue with rolling rebuilds when you’re not using Nix, which is that most naïve attempts to do this don’t ensure that the starting build products came from the parent commit. You can end up with contamination of build products across branches if you’re not careful, which further complicates reproducibility.

Non-solution #1: Lockfile

Okay, so suppose you periodically do a full build of the project from scratch and then each incremental build is relative to the last full build. You would need to do a full rebuild frequently enough so that the incremental builds stay quick. If you wait too long in between full rebuilds then the project will evolve to the point where the incremental builds can no longer reuse most of the build products from the last full build and in the extreme case the incremental builds degenerate into full builds if they can’t reuse any old build products.

For example, at our work we currently do a full build of our large package once a day, so we need some way to update the full build to point to the last revision from the preceding day.

One existing approach to solving this involved using Nix flakes to manage the git revision for the older build. The idea is that you periodically run nix flake update to update the revision used for the full build and you might even automate this process by having some recurring cron job generate a pull request or commit to bump this revision on the main development branch. You don’t have to use flakes for this purpose, but flakes are probably the most ergonomic solution along these lines.

However, there are a few issues with this approach:

  • It only works well for short-lived pull requests

    In other words, if you update the revision used for the full build once a day then typically only pull requests that are less than a day old will benefit from incremental builds.

    Specifically, what we’d really like is “branch-local” incremental builds. In other words if a longer-lived development branch were to deposit a few commits a day we’d like there to be a full rebuild once a day on that branch so that incremental builds against the tip of that development branch remain snappy.

  • It pollutes the git history

    If you bump the lockfile, say, once per day then that’s one junk commit that you’ve added to your git history every day.

  • It’s difficult to open source any useful automation around this

    If the solution requires out-of-band machinery (e.g. some recurring cron job) to bump the lockfile you can’t provide a great user experience for open source projects. It only really works well for proprietary projects that can tolerate that complexity.

That last point was the most important one for me. Generally, when I design something (even something intended for internal, proprietary use) I try to design it in such a way that it works well in an open source context, too. In my experience, doing things in this way tends to improve the design, quality, and user experience of software that I build.

In particular, I wanted a solution where all the automation could be implemented entirely within the Nix language. However, this is not possible in Nix’s present form!

Non-solution #2: Rollback derivation

So what I really wanted was a Nix function (which I will call “truncate”) that would take any git repository and roll it back in time to the last commit before some repeating time boundary (where the time boundary might be, say, an hour, or day, or week). For simplicity, let’s just say that the desired time interval is one day so I want to roll back the repository to the last revision from the day before.

If I had such a truncate function then it would be easy to automatically select which revision to use for the full build. I would:

  • extract the source git repository from the current Haskell package build

  • truncate that git repository to the last revision from the day before

  • Use that “truncated” revision as the source for the full build

  • Use that full build as the input to the current (incremental) build

Then if I built multiple revisions for the same day they would all share the same full build since they would all get “truncated” to the same revision from the previous day.

However, there isn’t a great way to implement this truncate function in Nix. To see why, consider the following (wrong) solution:

  • extract the source git repository from the current Haskell package build

    Let’s call the derivation for this git repository “src

  • create a new Nix derivation (“src2”) that rolls back src

    In other words, this would be a trivial Nix derivation that begins from src and runs something like:

    $ git checkout $(git rev-list -1 --before '1 day ago' HEAD)

    … and stores that as the result

  • Use src2 as the input to the full build

Do you see the problem with that approach?

The above wrong solution doesn’t allow multiple incremental builds from the same day to share the same full build from the prior day. This is because src2 depends on src and since each incremental build has a different src repository then each also have a different src2 derivation and therefore a different full build. That in turn defeats the purpose of incremental builds if we have to do a new full rebuild for each incremental build.

For this to work we would need a way to roll back a git repository to an older revision that less sensitive to the current revision.

Non-solution #3: Plain fetchGit

The builtins.fetchGit utility almost does what we want! This primitive function lets you fetch a git repository at evaluation time, like this:

nix-repl> builtins.fetchGit { url = ~/proj/turtle; revision = "837f52d2101368bc075d382774460a717904d2ab"; }
{ lastModified = 1655501878; lastModifiedDate = "20220617213758"; narHash = "sha256-Ic4N2gzm0hYsPCynkzETJv7lpAWO1KM+FO+r3ov60y0="; outPath = "/nix/store/ygznanxv6rmbxw5gkgk7axfxazhsa93z-source"; rev = "837f52d2101368bc075d382774460a717904d2ab"; revCount = 566; shortRev = "837f52d"; submodules = false; }

The above result is the same no matter what revision I currently have checked out at ~/proj/turtle because Nix’s fetchGit function produces a content-addressed derivation. In other words, if two invocations of fetchGit generate the same final repository state then they share the same outPath. This is exactly the behavior we want: we need the source repository for the full build to be content-addressed so that multiple incremental builds can share the same full build.

However, the problem is that I don’t exactly know which revision I want. What I really want to be able to say is “get me the last revision from the day before this other revision”. fetchGit does not expose any way to do something like that.

That brings us to the actual solution:

Solution

The solution I went with was the following two pull requests:

  • Add optional date argument to builtins.fetchGit

    This amends builtins.fetchGit to allow a date specification, which can either be a relative date (e.g. 1 day ago) or an absolute date (e.g. 2020-01-01T00:00:00 or a Unix timestamp like 1671388622). Basically, this argument accepts anything git accepts as a date specification (which is a lot since git is pretty flexible in this regard).

    The cool thing about this change is that it doesn’t compromise the purity of builtins.fetchGit. If a given fetchGit specification was pure then adding a date specification preserves that purity.

  • Add haskell.lib.incremental utility

    This pull request actually does two separate things:

    • This polishes and upstreams the prototype support for incremental builds

      In other words, this upstreams Harry and Jade’s work to split a Haskell build into two builds: a full build and incremental build

    • This uses the fetchGit patch to automate the full build selection

      There’s a new pkgs.haskell.lib.incremental utility which uses builtins.fetchGit to automatically update the full build for you and it has all the desired behaviors (including branch-local incrementalism).

    I could have split this into two separate pull request (and I still might) but for internal testing purposes it was easier to do everything on one branch. I’m waiting for a decision on the other pull request before deciding whether or not to split up this branch.

Demo

I’ll use my turtle package as the running example for the demo. If you clone the gabriella/incremental branch of my turtle repository:

$ git clone --branch gabriella/incremental \
    https://github.com/Gabriella439/turtle.git
$ cd turtle

… you’ll find the following default.nix file making use of the Nixpkgs support for incremental Haskell builds:

{ interval ? 24 * 60 * 60 }:

let
  nixpkgs = builtins.fetchTarball {
    url    = "https://github.com/MercuryTechnologies/nixpkgs/archive/696e0820b03e8ea7ad6a9ba21a00a79c91efc580.tar.gz";
    sha256 = "1k3swii3absl154154lmk6zjw11vzzqx8skaiw1250armgfyv9v8";
  };

  # We need GHC 9.4 or newer for this feature to work
  compiler ="ghc94";

  overlay = self: super: {
    haskell = super.haskell // {
      packages = super.haskell.packages // {
        "${compiler}" =
          super.haskell.packages."${compiler}".override (old: {
            overrides =
              self.lib.fold
                self.lib.composeExtensions
                (old.overrides or (_: _: { }))
                [ (self.haskell.lib.packageSourceOverrides {
                    turtle = ./.;
                  })

                  (hself: hsuper: {
                    turtle-incremental =
                      self.haskell.lib.compose.incremental
                        { inherit interval;

                          makePreviousBuild =
                            truncate: (import (truncate ./.) { }).turtle;
                        }
                        hsuper.turtle;
                  })
                ];
          });
      };
    };
  };

  pkgs = import nixpkgs { config = { }; overlays = [ overlay ]; };

in
  { inherit (pkgs.haskell.packages."${compiler}")
      turtle
      turtle-incremental
    ;
  }

However, that alone is not enough to make use of incremental builds. If you attempt to build that (at the time of this writing) you’ll get an error message like this:

$ nix build --file ./default.nix turtle-incremental
error: evaluation aborted with the following error message:
'pkgs.haskell.lib.incremental requires Nix version 2.12.0pre20221128_32c182b or
newer'
(use '--show-trace' to show detailed location information)

The Nixpkgs support for incremental builds depends on a matching change to the Nix interpreter, so you actually have to run:

$ nix run github:Gabriella439/nix/gabriella/fetchGit -- \
    build --file ./default.nix turtle-incremental

… or if you don’t yet have flakes enabled, then use this pedantically complete command:

$ nix --option extra-experimental-features 'nix-command flakes' \
    run github:Gabriella439/nix/gabriella/fetchGit -- \
    build --file ./default.nix turtle-incremental

… and that will definitely work.

Once the build is complete you can inspect the logs and you should see something like the following buildPhase:

$ nix log ./result
…
@nix { "action": "setPhase", "phase": "buildPhase" }
building
Preprocessing library for turtle-1.6.1..
Building library for turtle-1.6.1..
Preprocessing test suite 'regression-broken-pipe' for turtle-1.6.1..
Building test suite 'regression-broken-pipe' for turtle-1.6.1..
[2 of 2] Linking dist/build/regression-broken-pipe/regression-broken-pipe [Libr>
Preprocessing test suite 'regression-masking-exception' for turtle-1.6.1..
Building test suite 'regression-masking-exception' for turtle-1.6.1..
[2 of 2] Linking dist/build/regression-masking-exception/regression-masking-exc>
Preprocessing test suite 'tests' for turtle-1.6.1..
Building test suite 'tests' for turtle-1.6.1..
[2 of 2] Linking dist/build/tests/tests [Library changed]
Preprocessing test suite 'system-filepath-tests' for turtle-1.6.1..
Building test suite 'system-filepath-tests' for turtle-1.6.1..
[2 of 2] Linking dist/build/system-filepath-tests/system-filepath-tests [Librar>
Preprocessing test suite 'cptree' for turtle-1.6.1..
Building test suite 'cptree' for turtle-1.6.1..
[2 of 2] Linking dist/build/cptree/cptree [Library changed]
…

This is shows that the incremental builds are indeed working. We still have to re-link some executables (for reasons that are still not clear to me), but none of the Haskell modules needed to be rebuilt since nothing has changed (yet) since the last rebuild.

Now let’s test that by making a small whitespace change to one of the Turtle modules:

$ echo >> src/Turtle/Prelude.hs 

Then if we rebuild the package we’ll see the following build phase:

$ nix --option extra-experimental-features 'nix-command flakes' \
    run github:Gabriella439/nix/gabriella/fetchGit -- \
    build --file ./default.nix --print-build-logs
…
turtle> building
turtle> Preprocessing library for turtle-1.6.1..
turtle> Building library for turtle-1.6.1..
turtle> [ 7 of 10] Compiling Turtle.Prelude   ( src/Turtle/Prelude.hs, dist/build/Turtle/Prelude.o, dist/build/Turtle/Prelude.dyn_o ) [Source file changed]
turtle> src/Turtle/Prelude.hs:319:1: warning: [-Wunused-imports]
turtle>     The import of ‘Data.Monoid’ is redundant
turtle>       except perhaps to import instances from ‘Data.Monoid’
turtle>     To import instances alone, use: import Data.Monoid()
turtle>     |
turtle> 319 | import Data.Monoid ((<>))
turtle>     | ^^^^^^^^^^^^^^^^^^^^^^^^^
turtle> Preprocessing test suite 'regression-broken-pipe' for turtle-1.6.1..
turtle> Building test suite 'regression-broken-pipe' for turtle-1.6.1..
turtle> [2 of 2] Linking dist/build/regression-broken-pipe/regression-broken-pipe [Library changed]
turtle> Preprocessing test suite 'regression-masking-exception' for turtle-1.6.1..
turtle> Building test suite 'regression-masking-exception' for turtle-1.6.1..
turtle> [2 of 2] Linking dist/build/regression-masking-exception/regression-masking-exception [Library changed]
turtle> Preprocessing test suite 'tests' for turtle-1.6.1..
turtle> Building test suite 'tests' for turtle-1.6.1..
turtle> [2 of 2] Linking dist/build/tests/tests [Library changed]
turtle> Preprocessing test suite 'system-filepath-tests' for turtle-1.6.1..
turtle> Building test suite 'system-filepath-tests' for turtle-1.6.1..
turtle> [2 of 2] Linking dist/build/system-filepath-tests/system-filepath-tests [Library changed]
turtle> Preprocessing test suite 'cptree' for turtle-1.6.1..
turtle> Building test suite 'cptree' for turtle-1.6.1..
turtle> [2 of 2] Linking dist/build/cptree/cptree [Library changed]
…

Our package only built the “diff” (the Turtle.Prelude module we just changed)!

Benchmarks

For the turtle package the speed-up is not a huge deal because the package doesn’t take long time to compile, but the benefit for our main project at work is dramatic!

As I mentioned in the introduction, our work project normally takes ~14 minutes to build and after this change builds can be as fast as ~3.5 minutes. In fact, they could even be faster except for the presence of a Paths_* module that is rebuilt each time and triggers a large number of gratuitous downstream rebuilds (we’re working on fixing that).

Limitations

There is one major issue with this work, which is that it does not work well with flakes.

Specifically, if you try to turn the above default.nix into the equivalent flake the build will fail because Nix’s flake mechanism will copy the project into the /nix/store but without the .git history, so builtins.fetchGit will fail to to fetch the current repository’s history necessary to truncate the build to the previous day.

I believe this can be fixed with a change to flakes to support something like a ?shallow=false or ?allRefs=true addendum to git URLs, but I have not implemented that, yet.

6 comments:

  1. Nice!

    Have you considered using an import-from-derivation derivation to calculate the desired commit (via `git rev-list -1 --before '1 day ago' HEAD`), and then use that with fetchGit?

    Essentially your Non-Solution #2, but using import-from-derivation to break the unwanted dependency on the dirty `.git` directoy?

    ReplyDelete
    Replies
    1. Yes. I tried that and left out that solution, but there were a couple of reasons I went the "patch fetchGit route":

      - You would have to disable the sandbox for the solution based on import-from-derivation
      - fetchGit has a few niceties that I didn't want to reinvent (like the git-aware caching logic and better support for private repositories)

      Delete
  2. Where is the Paths_* work happening? And what's the approach to fixing it that you have in mind? (It also seems to be an issue with ghc-nix.)

    ReplyDelete
    Replies
    1. It ended up being a red herring for us. Even after fixing our code to not depend on a `Paths_*` module the problem still persisted. We're still narrowing down the problem to a minimal reproduction, but all we know is that at least one necessary component of the problem is having test modules that use either the `TemplateHaskell` or `TemplateHaskellQuotes` language extensions, because those will get rebuilt even if there are no changes to the library stanza. However, there seem to be other hidden requirements to trigger this issue because I can't reproduce it on smaller packages like `turtle`

      Delete
    2. Interesting. I've been experiencing (with ghc-nix) is full recompilation of the test suite on any change. I initially thought it was a preprocessor issue (hspec-discover), though removing that alone didn't fix things. It didn't occur to me that it might be QQ/TH. It's a bit strange that we wouldn't see this when just using plain cabal though.

      Delete
    3. I came back to this today, and think I found the reason. Cabal calls GHC twice on the test suite, as described starting here: https://github.com/haskell/cabal/issues/1177#issuecomment-22097657 (all the way to the end). For me at least, it seems this is the problem, since each call is isolated from the other, and so forces an entire rebuild.

      Delete