Haskell for all: The CAP theorem for software engineering

The CAP theorem for software engineering

The CAP theorem says that distributed computing systems cannot simultaneously guarantee all three of:

Consistency - Every read receives the most recent write or an error

Availability - Every request receives a (non-error) response - without the guarantee that it contains the most recent write

Partition tolerance - The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

Source: CAP theorem - Wikipedia

Since we cannot guarantee all three, we must typically sacrifice at least one of those guarantees (i.e. sacrifice availability or sacrifice partition tolerance).

However, what if we were to squint and apply the CAP theorem to another distributed system: a team of software engineers working towards a common goal.

In particular:

What if our data store were a distributed version control system?
What if our “nodes” were software developers instead of machines?

If we view engineering through this lens, we can recognize many common software engineering tradeoffs as special cases of CAP theorem tradeoffs. In other words, many architectural patterns also require us sacrifice at least one of consistency, availability, or partition tolerance among developers.

Before we get into examples, I’d like to review two points that come up in most discussions of the CAP theorem:

Partition tolerance

What does it mean to sacrifice partition tolerance? In the context of machines, this would require us to rule out the possibility of any sort of network failure.

Now replace machines with developers. We’d have to assume that people never miscommunicate, lose internet access, or fetch the wrong branch.

The possibility of partitions are what make a system a distributed system, so we’re usually not interested in the option of sacrificing partition tolerance. That would be like a computing system with only one machine or a team with only one developer.

Instead, we’ll typically focus on sacrificing either availability or consistency.

Spectrums of tradeoffs

In most systems you’re always sacrificing all three of consistency, availability, and partition tolerance to some degree if you look closely enough. For example, even a healthy machine is not 100% available if you consider that even the fastest network request still has an irreducible delay of around a few hundred microseconds on today’s machines.

In practice, we ignore these vanishingly small inconsistencies or inavailabilities, but they still illustrate a general pattern: we can think of system health/availability/consistency as spectrums rather than boolean options.

For example, if we say we choose availability over consistency, we really mean that we choose to make our system’s unavailability vanishingly small and that our system could be consistent, but not all the time. Indeed, if our hardware or network were both fast and extremely reliable we could enjoy both high consistency and high availability, but when things fail then we need to prioritize which of consistency or availability that we sacrifice to accommodate that failure.

We can also choose to sacrifice a non-trivial amount of both availability and consistency. Sometimes exclusively prioritizing one or the other is not the right engineering choice!

With those caveats out of the way, let’s view some common software engineering tradeoffs through the lens of the CAP theorem.

Monorepo vs. Polyrepo

In revision control systems, a monorepo (syllabic abbreviation of monolithic repository) is a software development strategy where code for many projects are stored in the same repository

Source: Monorepo: Wikipedia

A “polyrepo” is the opposite software development strategy where each project gets a different source repository. In a monorepo, projects depend on each other by their relative paths. In a polyrepo, a project can depend on another project by referencing a specific release/revision/build of the dependency.

The tradeoff between a monorepo and a polyrepo is a tradeoff between consistency and availability. A monorepo prioritizes consistency over availability. Conversely, a polyrepo prioritizes availability over consistency.

To see why, let’s pretend that project A depends on project B and we wish to make a breaking change to project B that requires matching fixes to project A. Let’s also assume that we have some sort of continuous integration that ensures that the master branch of any repository must build and pass tests.

In a polyrepo, we can make the breaking change to the master branch of project B before we are prepared to make the matching fix to project A. The continuous integration that we run for project B’s repository does not check that other “downstream” projects that depend on B will continue to build if they incorporate the change. In this scenario, we’ve deferred the work of integrating the two projects together and left the system in a state where the master branch of project B is not compatible with the master branch of project A. (Note: the master branch of project A may still build and pass tests, but only because it depends on an older version of project B).

In a monorepo, we must bundle the breaking change to project B and the fix to project A in a single logical commit to the master branch of our monorepo. The continuous integration for the monorepo prevents us from leaving the master branch in a state where some of the projects don’t build. Fixing these package incompatibilities up-front will delay merging work into the master branch (i.e. sacrificing availability of our work product) but imposing this restriction ensures that the entire software engineering organization has a unified view of the codebase (i.e. preserving consistency).

To make the analogy precise, let’s revisit the original definitions of consistency, availability, and partition tolerance:

Consistency - Every read receives the most recent write or an error

Availability - Every request receives a (non-error) response - without the guarantee that it contains the most recent write

Partition tolerance - The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes

… and change them to reflect the metaphor of a distributed team of developers collaborating via GitHub/GitLab:

Consistency - Every git pull receives the latest versions of all dependencies

Availability - Every pull request succeeds - without the guarantee that it contains the latest versions of all dependencies

Partition tolerance - git operations continue to work even if GitHub/GitLab is unavailable

Trunk based development vs. Long-lived branches

Our previous scenario assumed that each repository was using “trunk-based development”, defined as:

A source-control branching model, where developers collaborate on code in a single branch called “trunk” [and] resist any pressure to create other long-lived development branches by employing documented techniques.

Source: trunkbaseddevelopment.com

The opposite of trunk-based development is “long-lived branches” that are not the master branch (i.e. the “trunk” branch).

Here are some examples of long-lived branches you’ll commonly find in the wild:

A develop branch that is used as a the base branch of pull requests. This develop branch is periodically merged into master (typically at release boundaries)
Release branches that are suported for months or years (i.e. long-term support releases)
Feature branches that people work on for an extended period of time before merging their work into master

The choice between trunk-based development and long-lived branches is a choice between consistency and availability. Trunk-based development prioritizes consistency over availability. Long-lived branches prioritize availability over consistency.

To see why, imagine merging a feature branch over a year old back into the master branch. You’ll likely run into a large number of merge conflicts because up until now you sacrificed consistency by basing your work on an old version of the master branch. However, perhaps you would have slowed down your iteration speed (i.e. sacrificing availability of your local work product) if you had to ensure that each of your commits built against the latest master.

You might notice that trunk-based development vs. long-lived branches closely parallels monorepo vs. polyrepo. Indeed, organizations that prefer monorepos also tend to prefer trunk-based development because they both reflect the same preference for developers sharing a unified view of the codebase. Vice versa, organizations that prefer polyrepo also tend to prefer long-lived branches because both choices emerge from the same preference to prioritize availability of developers’ work product. These are not perfect correlations, though.

Continuous integration vs. Test team

Continuous Integration (CI) is a development practice that requires developers to integrate code into a shared repository several times a day. Each check-in is then verified by an automated build, allowing teams to detect problems early.

Source: thoughtworks.com - Continous Integration

So far we’ve been assuming the use of continuous integration to ensure that master stays “green”, but not all organization operate that way. Some don’t use continuous integration and rely on a test team to identify integration issues.

You can probably guess where this is going:

Continuous integration prioritizes consistency over availability
Use of a test team prioritizes availability over consistency

The more you rely on continuous integration, the more you need to catch (and fix) errors up front since the error-detection process is automated. The more you rely on a test team the more developers tend to defer detection of errors and bugs, leaving the system in a potentially buggy state for features not covered by automated states.

Organizations that use a test team prioritize availability of developers’ work product, but at the expense of possibly deferring consistency between components system-wide. Vice-versa, organizations that rely on continuous integration prioritize consistency of the fully integrated system, albeit sometimes at the expense of the progress of certain components.

Spectrums

Remember that each one of these choices is really a spectrum. For example:

Many monorepos are partially polyrepos, too, if you count their third-party dependencies. The only true monorepo is one with no external dependencies
The distinction between trunk-based development and long-lived branches is a matter of degree. There isn’t a bright line that separates a short-lived branch from a long-lived one.
Many organizations use a mix of continuous integration (to catch low-level issues) and a test team (to catch high-level issues). Also, every organization has an implicit test team: their customers, who will report bugs that even the best automation will miss.

Conclusion

This post is not an exhaustive list of software engineering tradeoffs that mirror the CAP theorem. I’ll wager that as you read this several other examples came to mind. Once you recognize the pattern you will begin to see this tension between consistency and availability everywhere (even outside of software engineering).

Hopefully this post can help provide a consistent language for talking about these choices so that people can frame these discussions in terms of their organization’s core preference for consistency vs availability. For example, maybe in the course of reading this you noticed that your organization prefers availability in some cases but consistency in others. Maybe that’s a mistake you need to correct or maybe it’s an inevitability since we can never truly have 100% availability or 100% consistency.

You might be interested in what happens if you take availability or consistency to their logical conclusion. For example, Kent Beck experiments with an extreme preference for consistency over availability in test && commit || revert. Or to put it more humorously:

“Don’t say you practice Continuous Integration if your editor isn’t autosaving to production”
— Epic beard dude (@davigoli) January 20, 2019

On the other hand, if you prioritize availability over consistency at all costs you get … the open source ecosystem.

This is not the first post exploring the relationship between the CAP theorem and software development. For example, Jessica Kerr already explored this idea of treating teams as distributed systems in Tradeoffs in Coordination Among Teams.

2 comments:

Derek ElkinsJune 16, 2019 at 1:48 PM
The PACELC (https://en.wikipedia.org/wiki/PACELC_theorem) formulation may be closer to what you want, particularly the trade-off between latency and consistency even without partitions.

There's a discrepancy in the description of CAP from Wikipedia. There is no restriction to reads for anything. All operations (reads or writes) need to succeed. This arguably doesn't make a difference for Consistency, though you can have operations like increment which do combined reads and writes. For availability, it's a bit more important.

The major issue here is your argument for why we can't have 100% available systems. You talk about minimum network latencies, but this is irrelevant. A Highly Available (i.e. AP) system should work even when every network link is down. That is, every valid operation should succeed even if the network is down. (You can imagine a partition that separates every node from every other node.) This means any Highly Available system needs to work "offline". The way this is accomplished is generally by having some local copy that you can operate against. This is a place where the distinction in PACELC comes in. Even if there isn't a network partition, we could still decide to operate from a local copy to reduce latency. (This is also where its important that Availability talks about all operations and not just read operations. I should be able to make updates without errors to this local copy too.)
initialed85June 23, 2019 at 5:36 PM
I really enjoyed your various real-world analogies for CAP theorem, awesome article

Haskell for all

Sunday, June 16, 2019