Friday, November 3, 2017

Semantic integrity checks are the next generation of semantic versioning

The Dhall configuration language just added support for "semantic integrity checks". This post explains what "semantic integrity check" means, motivates the new feature, and compares to semantic versioning.

The problem

I added this feature in response to user concerns about code injection in Dhall configuration files.

We'll illustrate the problem using the following example.dhall configuration file which derives a summary of student information from a list of students:

    -- Example of an expression imported by URL
    let map = http://prelude.dhall-lang.org/List/map

    -- Example of an expression imported by path
in  let students = ./students.dhall

in  let getName = λ(student : { name : Text, age : Natural }) → student.name

in  { classSize = List/length { name : Text, age : Natural } students
    , names     = map { name : Text, age : Natural } Text getName students
    }

This configuration imports a helper function named map from the Dhall Prelude by URL:

    let map = http://prelude.dhall-lang.org/List/map

in  ...

... and that URL currently hosts a text file encoding the following Dhall function:

$ curl -L http://prelude.dhall-lang.org/List/map
{-
Tranform a list by applying a function to each element

Examples:

./map Natural Bool Natural/even ([+2, +3, +5] : List Natural)
= [True, False, False] : List Bool

./map Natural Bool Natural/even ([] : List Natural)
= [] : List Bool
-}
let map : ∀(a : Type) → ∀(b : Type) → (a → b) → List a → List b
    =   λ(a : Type)
    →   λ(b : Type)
    →   λ(f : a → b)
    →   λ(xs : List a)
    →   List/build
        b
        (   λ(list : Type)
        →   λ(cons : b → list → list)
        →   List/fold a xs list (λ(x : a) → cons (f x))
        )

in  map

Similarly, our example configuration imports student data from another configuration file by path:

...

in  let students = ./students.dhall

...

... and we'll assume that file contains the following list of student records:

[ { name = "Jane Doe"    , age = +19 }
, { name = "John Rivera" , age = +18 }
, { name = "Alice O'Hare", age = +19 }
]

Values, functions, and types are all Dhall expressions, so we can inject all of them in our code via URLs or paths. When we interpret a Dhall configuration file these imports get substituted with their contents and then we evaluate the fully resolved configuration file as an expression in a functional language:

$ dhall <<< './example.dhall'  | dhall-format
{ classSize : Natural, names : List Text }

{ classSize = +3
, names     = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text
}

Users were concerned that these imports could be compromised, resulting in malicious code injection

The solution

The latest release of Dhall added support for import integrity checks to address user concerns about malicious tampering. We can use these integrity checks to "freeze" our imports by adding a SHA-256 hash after each import.

First, we ask the dhall-hash utility to compute the current hash for our imports:

$ dhall-hash <<< 'http://prelude.dhall-lang.org/List/map'
sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb
$ dhall-hash <<< './students.dhall' 
sha256:6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

... and then we append the hash after each import to freeze the import:

    let map =
          http://prelude.dhall-lang.org/List/map sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb

in  let students =
          ./students.dhall sha256:6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

in  let getName = λ(student : { name : Text, age : Natural }) → student.name

in  { classSize = length { name : Text, age : Natural } students
    , names     = map { name : Text, age : Natural } Text getName students
    }

Once you add these integrity checks the Dhall interpreter will enforce them when resolving imports. In this case, the example configuration still successfully evaluates to the same result after adding the integrity checks:

$ dhall <<< './example.dhall'  | dhall-format
{ classSize : Natural, names : List Text }

{ classSize = +3
, names     = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text
}

The integrity check passes because we haven't yet modified any of our imports.

Semantic integrity

Once you freeze an import with a hash, Dhall guarantees that the meaning of the import never changes. These are semantic hashes, not textual hashes.

For example, suppose that we modify ./students.dhall to add a comment, reorder record fields, and modify the formatting, like this:

-- Class of 2017

[ { age = +19, name = "Jane Doe" },
  { name = "John Rivera" , age = +18 },
  { name = "Alice O'Hare", age = +19 } ]

These changes do not affect the computed hash of the file and the interpreter still accepts the ./students.dhall import that we protected with an integrity check:

$ dhall <<< './example.dhall'  | dhall-format  # Still succeeds
{ classSize : Natural, names : List Text }

{ classSize = +3
, names     = [ "Jane Doe", "John Rivera", "Alice O'Hare" ] : List Text
}

The Dhall interpreter accepted the import of ./students.dhall because the semantic hash never changed:

$ dhall-hash <<< './students.dhall' 
sha256:6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

However, now suppose we try to change the substance of the file by modifying John's age:

-- Class of 2017

[ { age = +19, name = "Jane Doe" },
  { name = "John Rivera" , age = +20 },
  { name = "Alice O'Hare", age = +19 } ]

Now the semantic integrity check fails:

$ dhall <<< './example.dhall'

Error: Import integrity check failed

Expected hash:

↳ 6c4205ed51c0201abcccd1d90be4d7cd4c492246176ab404c35886a03d9dfc06

Actual hash:

↳ 808d921914de5349f50ac656bed93c2894dfe35401991e1ca0c89861834023fb

Dhall recognizes that this is no longer the same expression and rejects the import. Only an import that represents the same value can pass the check.

This means, for example, that malicious users cannot tamper with our imports, even if we were to distribute the imported code over an insecure channel. The worst that an attacker can do is cause our configuration to reject the import, but they cannot trick the configuration into silently accepting the wrong expression.

Refactoring

We can use these integrity checks to do more than just secure code. We can also repurpose these checks to assert that our code refactors are safe and behavior-preserving.

For example, suppose that we change the student list to:

-- Class of 2017

    let double = λ(x : Natural) → x * +2

in  [ { name = "Jane Doe"    , age = +19       }
    , { name = "John Rivera" , age = double +9 }
    , { name = "Alice O'Hare", age = +19       }
    ]

This will still pass the integrity check because the student list still evaluates to the same expected result.

We can also refactor our project layout, too. For example, we could modify the student list to import the double function from another file:

-- Class of 2017

[ { name = "Jane Doe"    , age = +19               }
, { name = "John Rivera" , age = ./double.dhall +9 }
, { name = "Alice O'Hare", age = +19               }
]

... where ./double.dhall has the following contents:

λ(x : Natural) → x * +2

... and the integrity check would still pass.

I originally introduced semantic integrity checks to protect against malicious code modification then later realized that they can also be used to protect against non-malicious modifications (such as a refactor gone wrong).

Textual hashes

The semantic hash provides a more information than a textual hash of the import. For example, suppose we changed our ./double.dhall function to triple the argument:

λ(x : Natural) → x * +3

A textual hash of the ./students.dhall import would not detect this change because the real change took place in the text of another file that ./students.dhall imported. However, A semantic hash can follow these imports to detect transitive changes to dependencies.

The semantic hash is also more flexible than a textual hash because the semantic hash does not change when we make cosmetic changes like refactoring, reformatting, or commenting code.

Caveats

Dhall's semantic versioning can reject some behavior-preserving changes to functions. Dhall only attempts to detect if two functions are β-equivalent (i.e. the same if fully β-reduced).

For example, the following two functions are equivalent, but will not produce the same hash:

λ(x : Bool) → x

λ(x : Bool) → if x then True else False

Similarly, Dhall's semantic hash cannot detect that these two functions are the same:

λ(x : Natural) → x * +2

λ(x : Natural) → x + x

On the other hand, Dhall will (almost) never give two semantically distinct expressions the same hash. Only an astronomically improbable hash collision can cause this and at the time of this writing there is no known vulnerability in the SHA-256 hash algorithm.

Dhall will support other hash algorithms should SHA-256 ever be broken. This is why Dhall prefixes the hash with the algorithm to leave the door open for new hash algorithms.

Semantic versioning

You might wonder how semantic integrity checks compare to semantic versioning. I like to think of semantic integrity checks and semantic versions as two special cases of the following abstract interface:

a package publishes a version string for each official release
you can compare two version strings to detect a breaking change to the package

Semantic versioning is one special case of that abstract interface where:

the version string has a major number and minor number
a difference in major version numbers signals a breaking change

Some variations on semantic versioning propose independently versioning each exported function/value/type instead of versioning the package as a whole. Also, some languages (like Elm) mechanically enforce semantic versioning by detecting API changes programmatically and forcing a major version bump if there is a breaking change.

A semantic integrity check is another special case of that abstract interface where:

the version string is a SHA-256 hash
if two hashes are different then that signals a breaking change

The key difference between semantic versioning and semantic integrity checks is how we define "a breaking change". Semantic version numbers (usually) treat changes to types as breaking changes whereas semantic integrity checks treat changes to values as breaking changes. (To be totally pedantic: semantic integrity checks treat changes to expressions as breaking changes, and in a language like Dhall everything is an expression, including types).

This does not imply that semantic integrity checks are better than semantic version numbers. Sometimes you want to automatically pick up small changes or improvements from your dependencies without adjusting a hash. In cases like those you want the expected type to be the contract with your dependency and you don't want to pin the exact value.

For example, we could "simulate" semantic versioning in Dhall by attaching a type annotation to our ./students.dhall import like this:

    let map =
          http://prelude.dhall-lang.org/List/map sha256:3063e9b34fd4235165a7a46e3ee3e0d0d7cded5da16f5572cc9e459ed5452fbb

in  let students =
          ./students.dhall : List { name : Text, age : Natural }

in  let getName = λ(student : { name : Text, age : Natural }) → student.name

in  { classSize = List/length { name : Text, age : Natural } students
    , names     = map { name : Text, age : Natural } Text getName students
    }

... and now we can add or remove students from our imported list without breaking anything. We've used the type system as a coarser integrity check to state that certain changes to our configuration file's meaning are okay.

Conclusion

You can think of a semantic integrity check as a "value annotation" (i.e. the term-level equivalent of a type annotation). Instead of declaring an expected type we declare an expected value summarized as a hash.

This is why the title of this post declares that "semantic integrity checks are the next generation of semantic versioning". If you think of a semantic version as a concise summary of an imported package's type, then a semantic integrity check is a concise summary of an imported package's value.

16 comments:

AnonymousNovember 27, 2017 at 2:58 AM
Hello, Gabriel! I hit an interesting problem: migration of config file. It's like migration in DB when you change schema but in config file. Problem here is that I need to switch FORWARD from version to version as well as BACKWARD (to old versions). IMHO to support this I have options: 1) to have some "isomorphism" between config schemes; 2) to keep config in DB and to make traditional migration with ALTER SQL-statements (sometimes they are not enough). I found your Dhall, so my questions are: 1) can DHall be used to solve such problems (as for me, it looks little bit complex, more close to Haskell then to JSON) 2) I suppose you check other solutions before to write DHall, so do you know another Haskell libs to solve configuration files with migration/multiple version supporting?

Currently I solved it with: unrelated to versions config data (I called it semantic) and versioning config data, so I install default config then apply semantic config. Sure, modification of config made by user is lost after switch to version with new schema. I'm hope you understand what problem I'm trying to describe :) More interesting even is to listen you thinks about this topic, IMHO you have a lot of them if you decide to create DHall :)

And thank you a lot for what do you do for Haskell community!
ReplyDelete
Replies
AnonymousNovember 27, 2017 at 9:47 AM
Cool! Gabriel, what were the reasons to have lambdas in data definition language? To have self-sufficient language like XML+XSD+XSLT (data+schema+transformation) ? Or something else?
ReplyDelete
Replies
UnknownJanuary 27, 2018 at 1:13 PM
Sounds like you could also use this for caching - if a piece of code's semantic hash hasn't changed, and you've cached the result of the previous computation - then you can just reuse the cached value.

Of course the assumption there would be the piece of code in question doesn't involve any side effects. (but it might also require your hashing algorithm to take into account of code dependencies).
ReplyDelete
Replies
UnknownJanuary 27, 2018 at 2:48 PM
Yes indeed that's pretty useful - thanks for sharing!
ReplyDelete
Replies

Add comment

Haskell for all