Monday, November 27, 2017

Compare Nix derivations using nix-diff

I'm announcing a small nix-diff utility I wrote for comparing Nix derivations. This post will walk through two use cases for how you might use this utility.

Background

This section provides some required background for understanding this post if you're new to Nix.

There are three stages to a Nix build:

  • Nix source code (i.e. *.nix files)
    • This corresponds to a source distribution in a typical package manager
  • Nix derivations (i.e. /nix/store/*.drv files)
    • This is the stage that caching works at
  • Nix build products (i.e. /nix/store/* files that are not derivations)
    • This corresponds to a binary distribution in a typical package manager

You can convert between these stages using the following command-line tools:

  • nix-instantiate converts Nix source code to Nix derivations
    • i.e. *.nix → /nix/store/*.drv
  • nix-store --realise converts Nix derivations to Nix build products
    • i.e. /nix/store/*.drv → /nix/store/*
  • nix-build is a convenience utility which combines the two preceding steps to go straight from source code to build products
    • i.e. *.nix → /nix/store/*

Nix supports caching binary build products so if you try to build the same derivation twice then the second build will reuse the result of the first build (i.e. a "cache hit"). If the derivation changes in any way, you get a "cache miss" and you need to build the derivation.

Carefully note that caching works at the level of Nix derivations and not at the level of Nix source code. For example, the following two Nix files differ at the source code level:

$ cat example0.nix 
let
  pkgs = import <nixpkgs> { };

in
  pkgs.hello

$ cat example1.nix 
(import <nixpkgs> { }).hello

... but they produce the exact same derivation file:

$ nix-instantiate example0.nix 
/nix/store/ajypjz54a8rn1qxsnhyr8m87w6hd7ghp-hello-2.10.drv

$ nix-instantiate example1.nix 
/nix/store/ajypjz54a8rn1qxsnhyr8m87w6hd7ghp-hello-2.10.drv

... which means that if you try to build both example0.nix and example1.nix the build will only occur once since they share the same derivation.

You can think of the derivation file as a language-independent description of how to build something:

$ fold /nix/store/ajypjz54a8rn1qxsnhyr8m87w6hd7ghp-hello-2.10.drv 
Derive([("out","/nix/store/1ijp0xy3s6idns5047lxky6nlj4lrcap-hello-2.10","","")],
[("/nix/store/3ma3q2qf60gvsqs4w0k2krcyikr1pvhf-bash-4.4-p12.drv",["out"]),("/nix
/store/8g65wh5ng9dc68mz07wlznzg4f2zqhlh-stdenv-darwin.drv",["out"]),("/nix/store
/gqwk1j05s2zfykfj4y9k15gs4zl0lynr-hello-2.10.tar.gz.drv",["out"])],["/nix/store/
9krlzvny65gdc8s7kpb6lkx8cd02c25b-default-builder.sh","/nix/store/z347hsajryw593h
802ggb63lbr3gpv2b-standard-sandbox.sb"],"x86_64-darwin","/nix/store/axikcsz4wh2q
pi5zmlfsmm4jx8wm8s1g-bash-4.4-p12/bin/bash",["-e","/nix/store/9krlzvny65gdc8s7kp
b6lkx8cd02c25b-default-builder.sh"],[("__impureHostDeps","/System/Library/Framew
orks/CoreFoundation.framework/CoreFoundation /dev/zero /dev/random /dev/urandom 
/bin/sh"),("__propagatedImpureHostDeps",""),("__propagatedSandboxProfile",""),("
__sandboxProfile","(allow file-read* (literal \"/usr/lib/libncurses.5.4.dylib\")
)\n(import \"/nix/store/z347hsajryw593h802ggb63lbr3gpv2b-standard-sandbox.sb\")\
n"),("buildInputs",""),("builder","/nix/store/axikcsz4wh2qpi5zmlfsmm4jx8wm8s1g-b
ash-4.4-p12/bin/bash"),("configureFlags",""),("doCheck","1"),("name","hello-2.10
"),("nativeBuildInputs",""),("out","/nix/store/1ijp0xy3s6idns5047lxky6nlj4lrcap-
hello-2.10"),("propagatedBuildInputs",""),("propagatedNativeBuildInputs",""),("s
rc","/nix/store/3x7dwzq014bblazs7kq20p9hyzz0qh8g-hello-2.10.tar.gz"),("stdenv","
/nix/store/dl508ngmyfglplp338np4lnx98prwsbd-stdenv-darwin"),("system","x86_64-da
rwin")])

These *.drv files use the ATerm file format and are Nix-independent. Conceptually, Nix is just a domain-specific language for generating these ATerm files. That means, for example, that you could replace Nix with any front-end language or tool that can generate these ATerm files. In fact, this is how Guix works, by replacing Nix with Guile Scheme as the front-end language.

Understanding how Nix derivations work is fundamental to understanding the Nix ecosystem. nix-diff is one tool that aids this learning process as the following sections will illustrate.

Cache misses

nix-diff is a tool that I wish I had back when Awake Security first adopted Nix. We frequently ran into cache misses when using Nix because of subtle differences in Nix derivations in different development environments.

We can understand why we got cache misses by referring back to the three stages of a Nix build:

  • Nix source code (i.e. *.nix files)
  • Nix derivations (i.e. /nix/store/*.drv files)
  • Nix build products (i.e. /nix/store/* files that are not derivations)

For production we prefer to distribute Nix build products (i.e. binary distributions), but internally for development we distribute Nix source code. We prefer Nix code internally because this gives developers complete control over all of their transitive dependencies. For example, a developer can easily patch the systemd executable used on the virtual machine that runs their integration tests.

However, this flexibility comes at a price: if you don't know what you are doing you can easily accidentally change the derivation. This is because Nix and Nixpkgs are customizable to a fault and they have all sorts of "impure" defaults that change depending on the development environment. If you trip over one of these pitfalls you end up with a cache miss, which is a poor user experience.

The most common pitfalls we ran into early on in our Nix adoption were:

  • Not pinning nixpkgs
    • Note: We publicly shared our recipe for pinning nixpkgs here
  • Not pinning the system field for a derivation
    • This field defaults to the impure builtins.currentSystem in many cases
  • Impure surprises in nixpkgs

Let's motivate this with a real example. Suppose that I have the following derivation to build the Glasgow Haskell compiler (ghc):

$ cat example0.nix
let
  pkgs = import <nixpkgs> { };

in
  pkgs.ghc

This Nix expression is "impure" because the expression depends on the ambient nixpkgs channel that the user has installed. Compare this to the following expression which pins nixpkgs to a specific revision protected by a hash:

$ cat example1.nix
let
  # https://nixos.wiki/wiki/How_to_fetch_Nixpkgs_with_an_empty_NIX_PATH
  fetchNixpkgs = import ./fetchNixpkgs.nix;

  nixpkgs = fetchNixpkgs {
     rev = "76d649b59484607901f0c1b8f737d8376a904019";
     sha256 = "01c2f4mj4ahir0sxk9kxbymg2pki1pc9a3y6r9x6ridry75fzb8h";
  };

  pkgs = import nixpkgs { };

in
  pkgs.ghc

Let's instantiate the two expressions to compute their derivations:

$ nix-instantiate example0.nix 
/nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv
$ nix-instantiate example1.nix 
/nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv

Note that you may get a different result for the first derivation depending on what version of the nixpkgs channel you have installed.

Visually comparing the two derivation files is tedious and time-consuming:

$ fold /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv | head
Derive([("doc","/nix/store/x3hcyy01kb980yiirjjb3svzrdb0pqdy-ghc-8.0.2-doc","",""
),("man","/nix/store/l1ws9nypjg4xh8jj47dapx71cmgfb97a-ghc-8.0.2-man","",""),("ou
t","/nix/store/76b5ryd9wsc0iimlfz6f4n8kgawf8cli-ghc-8.0.2","","")],[("/nix/store
/1ncnhkd9r4k3wmlwbymccfhlqp3bk2cp-python2.7-Sphinx-1.6.5.drv",["out"]),("/nix/st
ore/2zdlq3dj3mk91ccya7k9z6d5i7lag912-clang-wrapper-4.0.1.drv",["out"]),("/nix/st
ore/3ma3q2qf60gvsqs4w0k2krcyikr1pvhf-bash-4.4-p12.drv",["out"]),("/nix/store/5mp
3qjkbzvmi4yvin1dbfdr1bkzgq9dl-perl-5.24.3.drv",["out"]),("/nix/store/8g65wh5ng9d
c68mz07wlznzg4f2zqhlh-stdenv-darwin.drv",["out"]),("/nix/store/9z3ykw788f50yhi4f
nn3s1ldyyg5s99x-ncurses-5.9.drv",["dev","out"]),("/nix/store/hw59y7rf8w28s123b51
ac57kbd0azjvh-coreutils-8.28.drv",["out"]),("/nix/store/km0zhgg5ykpnwnrczinggxs5

$ fold /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv | head
Derive([("doc","/nix/store/qlg3a9923hbcb1vhhaka90c33vrfgbrv-ghc-8.0.2-doc","",""
),("out","/nix/store/69spfrh96hc6y3hcb7w4i0l6s25pslkd-ghc-8.0.2","","")],[("/nix
/store/0ci2jv8sygw63hyl48ac6caw7fn3jrd7-ncurses-5.9.drv",["dev","out"]),("/nix/s
tore/1ksvs625n8lwjhjxld446gn9ql23v5k8-bash-4.4-p5.drv",["out"]),("/nix/store/dqj
rkys7d0c2z4ggny27a0vzpbzvz8y2-ghc-8.0.2-src.tar.xz.drv",["out"]),("/nix/store/dw
srl4iqnc3ij79h2xfn8fl3xnnk2zrg-gmp-6.1.1.drv",["dev","out"]),("/nix/store/gk2ng3
j3ixx6diq5s4xmysj670k62lly-perl-5.22.3.drv",["out"]),("/nix/store/i00ja8b4y0yv9b
aj7qd0caj6az0c8phj-ghc-7.10.3.drv",["out"]),("/nix/store/k82idwsbgby27nkjrwr9bhq
64c95irgf-coreutils-8.26.drv",["out"]),("/nix/store/nmkqpzlahvmpsnn0s5knc6wspy6b
305l-stdenv-darwin.drv",["out"]),("/nix/store/qv0cpl2g4bk5nn5l2hx5fyc2dw6xdjc9-c

If we use nix-diff, then we can pull out the differences immediately:

$ nix-diff /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv 
- /nix/store/fx0xn9djgvvw3h5jdmwybg0ga5qk844d-ghc-8.0.2.drv:{out}
+ /nix/store/9shbgc70h32f99nasdd6f8fd7cf9c645-ghc-8.0.2.drv:{out}The set of outputs do not match:
    + {man}The builders do not match
    - /nix/store/hsk82g493i7r496ghs0y61m6yvknxcml-bash-4.4-p5/bin/bash
    + /nix/store/axikcsz4wh2qpi5zmlfsmm4jx8wm8s1g-bash-4.4-p12/bin/bash
• The set of input names do not match:
    - bash-4.4-p5
    - clang-wrapper-3.7.1
    - coreutils-8.26
    - gmp-6.1.1
    - perl-5.22.3
    - python2.7-Sphinx-1.5.2
    + bash-4.4-p12
    + clang-wrapper-4.0.1
    + coreutils-8.28
    + gmp-6.1.2
    + perl-5.24.3
    + python2.7-Sphinx-1.6.5

Now we can see at a glance that the versions of several dependencies changed and GHC has split out its man pages into a new man output for better granularity of the build graph.

Note that these are not the only differences between the two derivations. However, all of the other differences are downstream of the above differences. For example, the two derivations have different out paths, but we expect them to differ for any two derivations that are not identical so there's no point including that in the diff. nix-diff makes an effort to highlight the root cause of the difference.

Understanding differences

Nix is more than just a package manager. You can use Nix to build and deploy an entire machine, which is how NixOS (the Nix operating system) works. The machine configuration is a Nix expression that you can instantiate and build like any other Nix expression.

This means that we can also use nix-diff to compare two machine configurations and understand how they differ. For example, when we change our production systems at Awake Security we sometimes run the change through nix-diff during code review to ensure that reviewers understand every change being made to the system.

We can illustrate this with a small example comparing two NixOS system specifications. The first system specification is a mostly blank system:

$ cat example0.nix
let
  nixos = import <nixpkgs/nixos> {
    system = "x86_64-linux";

    configuration = {
      boot.loader.grub.devices = [ "/dev/sda" ];

      fileSystems."/" = {
        device = "/dev/sda";
      };
    };
  };

in
  nixos.system

... and the second specification enables Kafka on the system:

$ cat example1.nix
let
  nixos = import <nixpkgs/nixos> {
    system = "x86_64-linux";

    configuration = {
      boot.loader.grub.devices = [ "/dev/sda" ];

      fileSystems."/" = {
        device = "/dev/sda";
      };

      services.apache-kafka.enable = true;
    };
  };

in
  nixos.system

We can differentiate the two derivations in one step like this:

$ nix-diff $(nix-instantiate example0.nix) $(nix-instantiate example1.nix)
- /nix/store/6z9nr5pzs4j1v9mld517dmlcz61zy78z-nixos-system-nixos-18.03pre119245.
5cfd049a03.drv:{out}
+ /nix/store/k05ibijg0kknvwrgfyb7dxwjrs8qrlbj-nixos-system-nixos-18.03pre119245.
5cfd049a03.drv:{out}The input named `etc` differs
  - /nix/store/05c0v10pla0v8rfl44rs744m6wr729jy-etc.drv:{out}
  + /nix/store/8waqvzjg7bazzfzr49m89q299kz972wv-etc.drv:{out}The input named `dbus-1` differs
    - /nix/store/a16j2snzz25dhh96jriv3p6cgkc0vhxr-dbus-1.drv:{out}
    + /nix/store/mliabzdkqaayya67xiwfhwkg4gs9k0cg-dbus-1.drv:{out}The input named `system-path` differs
      - /nix/store/jcf6q7na01j8k9xcmqxykl62k4x6zwiv-system-path.drv:{out}
      + /nix/store/kh4kgsms24d02bxlrxb062pgsbs3riws-system-path.drv:{out}The set of input names do not match:
          + apache-kafka-2.12-0.10.2.0
  • The input named `system-path` differs
    • These two derivations have already been compared
  • The input named `system-units` differs
    - /nix/store/yqnqdajd4664rvycrnwxwaj0mxp7602c-system-units.drv:{out}
    + /nix/store/2p5c4arwqphdz5wsvz6dbrgv0vhgf5qh-system-units.drv:{out}The set of input names do not match:
        + unit-apache-kafka.service
  • The input named `user-units` differs
    - /nix/store/x34dqw5y34dq6fj5brj2b5qf0nvglql9-user-units.drv:{out}
    + /nix/store/4iplnk260q2dpr8b8ajrjkrn44yk06aq-user-units.drv:{out}The input named `unit-dbus.service` differs
      - /nix/store/fd6j972zn1hfvqslxc8c64xxaf1wg475-unit-dbus.service.drv:{out}
      + /nix/store/s7rpgwbald9qx8rwlw4v276wj2x3ld8r-unit-dbus.service.drv:{out}The input named `dbus-1` differs
        • These two derivations have already been compared
• The input named `system-path` differs
  • These two derivations have already been compared
• The input named `users-groups.json` differs
  - /nix/store/x6c7pqx40wfdzwf96jfi1l0hzxjgypri-users-groups.json.drv:{out}
  + /nix/store/gk5yyjw579hgyxgwbrh1kzb3hbdbzgbq-users-groups.json.drv:{out}The environments do not match:
      text=''
          {"groups":[{"gid":55,"members":[],"name":"adm"},{"gid":17,"members":[]
,"name":"audio"},{"gid":24,"members":[],"name":"cdrom"},{"gid":27,"members":[],"
name":"dialout"},{"gid":6,"members":[],"name":"disk"},{"gid":18,"members":[],"na
me":"floppy"},{"gid":174,"members":[],"name":"input"},{"gid":96,"members":[],"na
me":"keys"},{"gid":2,"members":[],"name":"kmem"},{"gid":20,"members":[],"name":"
lp"},{"gid":4,"members":[],"name":"messagebus"},{"gid":30000,"members":["nixbld1
","nixbld10","nixbld11","nixbld12","nixbld13","nixbld14","nixbld15","nixbld16","
nixbld17","nixbld18","nixbld19","nixbld2","nixbld20","nixbld21","nixbld22","nixb
ld23","nixbld24","nixbld25","nixbld26","nixbld27","nixbld28","nixbld29","nixbld3
","nixbld30","nixbld31","nixbld32","nixbld4","nixbld5","nixbld6","nixbld7","nixb
ld8","nixbld9"],"name":"nixbld"},{"gid":65534,"members":[],"name":"nogroup"},{"g
id":0,"members":[],"name":"root"},{"gid":62,"members":[],"name":"systemd-journal
"},{"gid":110,"members":[],"name":"systemd-journal-gateway"},{"gid":152,"members
":[],"name":"systemd-network"},{"gid":153,"members":[],"name":"systemd-resolve"}
,{"gid":154,"members":[],"name":"systemd-timesync"},{"gid":25,"members":[],"name
":"tape"},{"gid":3,"members":[],"name":"tty"},{"gid":100,"members":[],"name":"us
ers"},{"gid":29,"members":[],"name":"utmp"},{"gid":19,"members":[],"name":"uucp"
},{"gid":26,"members":[],"name":"video"},{"gid":1,"members":[],"name":"wheel"}],
"mutableUsers":true,"users":[{"createHome":false,"description":"→Apache Kafka 
daemon user","group":"nogroup","hashedPassword":null,"home":"/tmp/kafka-logs","i
nitialHashedPassword":null,"initialPassword":null,"isSystemUser":false,"name":"a
pache-kafka","password":null,"passwordFile":null,"shell":"/run/current-system/sw
/bin/nologin","uid":169},{"createHome":false,"description":"→D-Bus system mess
...

However, this doesn't do the diff justice because the output is actually colorized, like this:

From the diff we can see that:

  • This change adds Kafka executables to the system PATH
  • This change adds a new apache-kafka systemd service
  • This change adds a new apache-kafka user to the system

Note how nix-diff does more than diffing the two root derivations. If the two derivations differ on a shared input then nix-diff will descend into that input and diff that and repeat the process until the root cause of the change is found. This works because Nix's dependency graph is complete and reachable from the root derivation.

Conclusion

You can find the nix-diff utility on Hackage or GitHub if you would like to use this in your own development workflow. Hopefully nix-diff will help you better understand how Nix works under the hood and also help you pin Nix derivations more robustly.

Friday, November 3, 2017

Semantic integrity checks are the next generation of semantic versioning

The Dhall configuration language just added support for "semantic integrity checks". This post explains what "semantic integrity check" means, motivates the new feature, and compares to semantic versioning.

The problem

I added this feature in response to user concerns about code injection in Dhall configuration files.

We'll illustrate the problem using the following example.dhall configuration file which derives a summary of student information from a list of students:

    -- Example of an expression imported by URL
    let map = http://prelude.dhall-lang.org/List/map

    -- Example of an expression imported by path
in  let students = ./students.dhall

in  let getName = λ(student : { name : Text, age : Natural })  student.name

in  { classSize = List/length { name : Text, age : Natural } students
    , names     = map { name : Text, age : Natural } Text getName students
    }

This configuration imports a helper function named map from the Dhall Prelude by URL:

    let map = http://prelude.dhall-lang.org/List/map

in  ...

... and that URL currently hosts a text file encoding the following Dhall function:

$ curl -L http://prelude.dhall-lang.org/List/map
{-
Tranform a list by applying a function to each element

Examples:

./map Natural Bool Natural/even ([+2, +3, +5] : List Natural)
= [True, False, False] : List Bool

./map Natural Bool Natural/even ([] : List Natural)
= [] : List Bool
-}
let map : ∀(a : Type) → ∀(b : Type) → (a → b) → List a → List b
    =   λ(a : Type)
    →   λ(b : Type)
    →   λ(f : a → b)
    →   λ(xs : List a)
    →   List/build
        b
        (   λ(list : Type)
        →   λ(cons : b → list → list)
        →   List/fold a xs list (λ(x : a) → cons (f x))
        )

in  map

Similarly, our example configuration imports student data from another configuration file by path:

...

in  let students = ./students.dhall

...

... and we'll assume that file contains the following list of student records:

[ { name = "Jane Doe"    , age = +19 }
, { name = "John Rivera" , age = +18 }
, { name = "Alice O'Hare", age = +19 }
]

Values, functions, and types are all Dhall expressions, so we can inject all of them in our code via URLs or paths. When we interpret a Dhall configuration file these imports get substituted with their contents and then we evaluate the fully resolved configuration file as an expression in a functional language:

$ dhall <<< './example.dhall'  | dhall-format
{ classSize : Natural, names : List Text }

{ classSize = +3
, names     = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text
}

Users were concerned that these imports could be compromised, resulting in malicious code injection

The solution

The latest release of Dhall added support for import integrity checks to address user concerns about malicious tampering. We can use these integrity checks to "freeze" our imports by adding a SHA-256 hash after each import.

First, we ask the dhall-hash utility to compute the current hash for our imports:

$ dhall-hash <<< 'http://prelude.dhall-lang.org/List/map'
sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb
$ dhall-hash <<< './students.dhall' 
sha256:6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

... and then we append the hash after each import to freeze the import:

    let map =
          http://prelude.dhall-lang.org/List/map sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb

in  let students =
          ./students.dhall sha256:6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

in  let getName = λ(student : { name : Text, age : Natural })  student.name

in  { classSize = length { name : Text, age : Natural } students
    , names     = map { name : Text, age : Natural } Text getName students
    } 

Once you add these integrity checks the Dhall interpreter will enforce them when resolving imports. In this case, the example configuration still successfully evaluates to the same result after adding the integrity checks:

$ dhall <<< './example.dhall'  | dhall-format
{ classSize : Natural, names : List Text }

{ classSize = +3
, names     = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text
}

The integrity check passes because we haven't yet modified any of our imports.

Semantic integrity

Once you freeze an import with a hash, Dhall guarantees that the meaning of the import never changes. These are semantic hashes, not textual hashes.

For example, suppose that we modify ./students.dhall to add a comment, reorder record fields, and modify the formatting, like this:

-- Class of 2017

[ { age = +19, name = "Jane Doe" },
  { name = "John Rivera" , age = +18 },
  { name = "Alice O'Hare", age = +19 } ]

These changes do not affect the computed hash of the file and the interpreter still accepts the ./students.dhall import that we protected with an integrity check:

$ dhall <<< './example.dhall'  | dhall-format  # Still succeeds
{ classSize : Natural, names : List Text }

{ classSize = +3
, names     = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text
}

The Dhall interpreter accepted the import of ./students.dhall because the semantic hash never changed:

$ dhall-hash <<< './students.dhall' 
sha256:6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

However, now suppose we try to change the substance of the file by modifying John's age:

-- Class of 2017

[ { age = +19, name = "Jane Doe" },
  { name = "John Rivera" , age = +20 },
  { name = "Alice O'Hare", age = +19 } ]

Now the semantic integrity check fails:

$ dhall <<< './example.dhall'

Error: Import integrity check failed

Expected hash:

↳ 6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

Actual hash:

↳ 808d921914de5349f50ac656bed93c2894dfe35401991e1ca0c89861834023fb

Dhall recognizes that this is no longer the same expression and rejects the import. Only an import that represents the same value can pass the check.

This means, for example, that malicious users cannot tamper with our imports, even if we were to distribute the imported code over an insecure channel. The worst that an attacker can do is cause our configuration to reject the import, but they cannot trick the configuration into silently accepting the wrong expression.

Refactoring

We can use these integrity checks to do more than just secure code. We can also repurpose these checks to assert that our code refactors are safe and behavior-preserving.

For example, suppose that we change the student list to:

-- Class of 2017

    let double = λ(x : Natural)  x * +2

in  [ { name = "Jane Doe"    , age = +19       }
    , { name = "John Rivera" , age = double +9 }
    , { name = "Alice O'Hare", age = +19       }
    ]

This will still pass the integrity check because the student list still evaluates to the same expected result.

We can also refactor our project layout, too. For example, we could modify the student list to import the double function from another file:

-- Class of 2017

[ { name = "Jane Doe"    , age = +19               }
, { name = "John Rivera" , age = ./double.dhall +9 }
, { name = "Alice O'Hare", age = +19               }
]

... where ./double.dhall has the following contents:

λ(x : Natural)  x * +2

... and the integrity check would still pass.

I originally introduced semantic integrity checks to protect against malicious code modification then later realized that they can also be used to protect against non-malicious modifications (such as a refactor gone wrong).

Textual hashes

The semantic hash provides a more information than a textual hash of the import. For example, suppose we changed our ./double.dhall function to triple the argument:

λ(x : Natural)  x * +3

A textual hash of the ./students.dhall import would not detect this change because the real change took place in the text of another file that ./students.dhall imported. However, A semantic hash can follow these imports to detect transitive changes to dependencies.

The semantic hash is also more flexible than a textual hash because the semantic hash does not change when we make cosmetic changes like refactoring, reformatting, or commenting code.

Caveats

Dhall's semantic versioning can reject some behavior-preserving changes to functions. Dhall only attempts to detect if two functions are β-equivalent (i.e. the same if fully β-reduced).

For example, the following two functions are equivalent, but will not produce the same hash:

λ(x : Bool)  x
λ(x : Bool)  if x then True else False

Similarly, Dhall's semantic hash cannot detect that these two functions are the same:

λ(x : Natural)  x * +2
λ(x : Natural)  x + x

On the other hand, Dhall will (almost) never give two semantically distinct expressions the same hash. Only an astronomically improbable hash collision can cause this and at the time of this writing there is no known vulnerability in the SHA-256 hash algorithm.

Dhall will support other hash algorithms should SHA-256 ever be broken. This is why Dhall prefixes the hash with the algorithm to leave the door open for new hash algorithms.

Semantic versioning

You might wonder how semantic integrity checks compare to semantic versioning. I like to think of semantic integrity checks and semantic versions as two special cases of the following abstract interface:

  • a package publishes a version string for each official release
  • you can compare two version strings to detect a breaking change to the package

Semantic versioning is one special case of that abstract interface where:

  • the version string has a major number and minor number
  • a difference in major version numbers signals a breaking change

Some variations on semantic versioning propose independently versioning each exported function/value/type instead of versioning the package as a whole. Also, some languages (like Elm) mechanically enforce semantic versioning by detecting API changes programmatically and forcing a major version bump if there is a breaking change.

A semantic integrity check is another special case of that abstract interface where:

  • the version string is a SHA-256 hash
  • if two hashes are different then that signals a breaking change

The key difference between semantic versioning and semantic integrity checks is how we define "a breaking change". Semantic version numbers (usually) treat changes to types as breaking changes whereas semantic integrity checks treat changes to values as breaking changes. (To be totally pedantic: semantic integrity checks treat changes to expressions as breaking changes, and in a language like Dhall everything is an expression, including types).

This does not imply that semantic integrity checks are better than semantic version numbers. Sometimes you want to automatically pick up small changes or improvements from your dependencies without adjusting a hash. In cases like those you want the expected type to be the contract with your dependency and you don't want to pin the exact value.

For example, we could "simulate" semantic versioning in Dhall by attaching a type annotation to our ./students.dhall import like this:

    let map =
          http://prelude.dhall-lang.org/List/map sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb

in  let students =
          ./students.dhall : List { name : Text, age : Natural }

in  let getName = λ(student : { name : Text, age : Natural })  student.name

in  { classSize = List/length { name : Text, age : Natural } students
    , names     = map { name : Text, age : Natural } Text getName students
    } 

... and now we can add or remove students from our imported list without breaking anything. We've used the type system as a coarser integrity check to state that certain changes to our configuration file's meaning are okay.

Conclusion

You can think of a semantic integrity check as a "value annotation" (i.e. the term-level equivalent of a type annotation). Instead of declaring an expected type we declare an expected value summarized as a hash.

This is why the title of this post declares that "semantic integrity checks are the next generation of semantic versioning". If you think of a semantic version as a concise summary of an imported package's type, then a semantic integrity check is a concise summary of an imported package's value.