Wednesday, May 19, 2021

Module organization guidelines for Haskell projects

modules

This post collects a random assortment of guidelines I commonly share for how to organize Haskell projects.

Organize modules “vertically”, not “horizontally”

The glib summary of this rule is: don’t create a “Types” or “Constants” module.

“Vertically” organized modules are modules that group related functionality within the same module. For example, vertically-oriented modules for a simple interpreter might be:

  • A Syntax module

    … which contains the concrete syntax tree type and utilities for traversing, viewing, or editing that tree.

  • A Parsing module

    … which contains (or imports/re-exports) the parser type, a parser for the syntax tree, and error messages specific to parsing.

  • An Infer module

    … which contains the the type inference logic, exception types, and error messages specific to type-checking.

  • An Evaluation module

    … which logic for evaluating an expression, including possibly a separate data structure for a fully-evaluated abstract syntax tree.

  • A Pretty module

    … which specifies how to pretty-print or otherwise format expressions for display.

“Horizontally” organized modules mean that you organize code into modules based on which language features or imports the code relies upon. For example, horizontally-oriented modules for the same interpreter might be:

  • A Types module

    … which contains almost all types, including the concrete syntax tree, abstract syntax tree, parsing-related types, and exception types.

  • A Lib module

    … which contains almost all functions, including the parsers, type inference code, evaluator, and prettyprinter.

  • A Constants module

    … which contains almost all constants (including all error messages, timeouts, and help text).

  • An App module

    … which contains the main entry point for the program.

There are a few reasons I prefer vertical module organization over horizontal module organization:

  • Vertically-organized modules are easier to split into smaller packages

    For example, in a vertically-organized project I could separate out the Syntax, Parser, and Pretty modules into a separate standalone package, which could be used by other people to create an automatic code formatter for my language without having to depend on the type-checking or evaluation logic.

    Conversely, there would be little benefit in separating out a Types module from the equivalent horizontally-organized package. Typically, horizontal modules are so tightly coupled to one another that no subset of the modules is useful in isolation.

    The separability of vertical modules is an even bigger feature for proprietary projects that aspire to eventually open source their work. Vertically-organized projects are easier to open source a few modules at a time while keeping the proprietary pieces internal.

  • Vertically-organized modules tend to promote more granular and incremental build graphs

    In a horizontally-organized project, each new type you add to the Types module forces a rebuild of the entire package. In a vertically-organized project, if I completely rewrite the type-checking logic then only other modules that depend on type-checking will rebuild (and typically very few depend on type-checking).

  • Vertically-organized modules tend to group related changes

    A common issue in a horizontally-organized project is that every change touches almost every module, making new contributions harder and making related functionality more difficult to discover. In a vertically-organized project related changes tend to fall within the same module.

Naming conventions

I like to use the convention that the default module to import is the same as the package name, except replacing - with . and capitalizing words.

For example, if your package name is foo-bar-baz, then the default module the user imports to use your package is Foo.Bar.Baz.

Packages following this module naming convention typically do not have module names beginning with Control. or Data. prefixes (unless the package name happens to begin with a control- or data- prefix).

There are a few reasons I suggest this convention:

  • Users can easily infer which module to import from the package name

  • It tends to lead to shorter module names

    For example, the prettyprinter package recently switched to this idiom, which changed the default import from Data.Text.Prettyprint.Doc to Prettyprinter.

  • It reduces module naming clashes between packages

    The reasoning is that you are already claiming global namespace when naming a package, so should why not also globally reserve the module of the same name, too?

    However, this won’t completely eliminate the potential for name clashes for other non-default modules that your package exports.

The “God” library stanza

This is a tip for proprietary projects only: put all of your project’s code into one giant library stanza in your .cabal file, including the entrypoint logic (like your command-line interface), tests, and benchmarks. Then every other stanza in the .cabal file (i.e. the executables, test suites, and benchmarks) should just be a thin wrapper around something exported from one of your “library” modules.

For example, suppose that your package is named foo which builds an executable named bar. Your foo.cabal file would look like this (with only the relevant parts shown):

name: foo
…

library
  hs-source-dirs: src
  exposed-modules:
    Foo.Bar
  …

executable bar
  hs-source-dirs: bar
  main-is: Main.hs
  …

… where src/Foo/Bar.hs would look like this:

-- ./src/Foo/Bar.hs
module Foo.Bar where



main :: IO ()
main =-- Your real `main` goes here

… and bar/Main.hs is a trivial wrapper around Foo.Bar.main:

-- ./bar/Main.hs

module Main where

import qualified Foo.Bar

main :: IO ()
main = Foo.Bar.main

This tip specifically works around cabal repl’s poor support for handling changes that span multiple project stanzas (both cabal v1-repl and cabal v2-repl appear to have the problem).

To illustrate the issue, suppose that you use cabal repl to load the executable logic for the project like this:

$ cabal repl exe:bar

*Main>

Now if you change the Foo.Bar module and :reload the REPL, the REPL will not reflect the changes you made. This is a pain whenever you need to test changes that span your library and an executable, your library and a test suite, or your library and a benchmark.

Also, if you load an executable / test suite / benchmark into the REPL that depends on a separate library stanza then cabal repl will force you to generate object code for the library stanza, which is slow. Contrast that with using cabal repl to only load the library stanza, which will be faster because it won’t generate object code.

Moreover, ghcid uses cabal repl to power its fast type-checking loop, which means that ghcid also does not work well if you need to quickly switch between changes to the library stanza and other project stanzas.

The fix to all of these problems is: put all of your project’s logic into the library stanza and use only the library stanza as basis for your interactive development. Everything else (your executables, your test suites, and your benchmarks) is just a trivial wrapper around something exported from the library.

I don’t recommend this solution for open source projects, though. If you do this for a public package then your end users will hate you because your package’s library section will depend on test packages or benchmarking packages that can’t be disabled. In contrast, proprietary codebases rarely care about gratuitous dependencies (in my experience).

Conclusion

Those are all of the tips I can think of at the moment. Leave a comment if you think I missed another common practice.

Wednesday, May 5, 2021

The trick to avoid deeply-nested error-handling code

either-trick

This post documents a small trick that I use to avoid deeply-nested error-handling code. This trick is a common piece of Haskell folklore that many people either learn from others or figure out on their own, but I’m not sure what the official name of this trick is (so I had difficulty searching for prior art explaining this trick). However, I’ve taught this trick to others enough times that I think it merits a blog post of its own.

This post assumes some familiarity with Haskell’s Either type and do notation, but the Appendix at the end of the post will walk through all of the details using equational reasoning if you’re having trouble following along with how things work.

The motivating example

Let’s begin from the following contrived Either-based example that uses deeply nested error-handling:

{-# LANGUAGE NamedFieldPuns #-}

import Text.Read (readMaybe)

data Person = Person { age :: Int, alive :: Bool } deriving (Show)

example :: String -> String -> Either String Person
example ageString aliveString = do
    case readMaybe ageString of
        Nothing -> do
            Left "Invalid age string"

        Just age -> do
            if age < 0
                then do
                    Left "Negative age"

                else do
                    case readMaybe aliveString of
                        Nothing -> do
                            Left "Invalid alive string"

                        Just alive -> do
                            return Person{ age, alive }

… which we can use like this:

>>> example "24" "True"
Right (Person {age = 24, alive = True})

>>> example "24" "true"
Left "Invalid alive string"

>>> example "" "True"
Left "Invalid age string"

>>> example "-5" "True"
Left "Negative age"

Notice how the above coding style increases the nesting / indentation level every time we parse or validate the input in some way. We could conserve indentation by using 2-space indents or indenting only once for each level of nesting:

{-# LANGUAGE NamedFieldPuns #-}

import Text.Read (readMaybe)

data Person = Person { age :: Int, alive :: Bool } deriving (Show)

example :: String -> String -> Either String Person
example ageString aliveString = case readMaybe ageString of
  Nothing  -> Left "Invalid age string"
  Just age -> if age < 0
    then Left "Negative age"
    else case readMaybe aliveString of
      Nothing    -> Left "Invalid alive string"
      Just alive -> return Person{ age, alive }

However, I think most people writing code like that would prefer to keep the indentation level stable, no matter how many validations the code requires.

The trick

Fortunately, we can avoid nesting the code with the following change:

{-# LANGUAGE NamedFieldPuns #-}

import Text.Read (readMaybe)

data Person = Person { age :: Int, alive :: Bool } deriving (Show)

example :: String -> String -> Either String Person
example ageString aliveString = do
    age <- case readMaybe ageString of
        Nothing  -> Left "Invalid age string"
        Just age -> return age

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe aliveString of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

Here we make use of several useful properties:

  • return in Haskell does not short-circuit or exit from the surrounding code

    In fact, some Haskell programmers prefer to use pure (a synonym for return) to better convey the absence of short-circuiting behavior:

    {-# LANGUAGE NamedFieldPuns #-}
    
    import Text.Read (readMaybe)
    
    data Person = Person { age :: Int, alive :: Bool } deriving (Show)
    
    example :: String -> String -> Either String Person
    example ageString aliveString = do
        age <- case readMaybe ageString of
            Nothing  -> Left "Invalid age string"
            Just age -> pure age
    
        if age < 0
            then Left "Negative age"
            else pure ()
    
        alive <- case readMaybe aliveString of
            Nothing    -> Left "Invalid alive string"
            Just alive -> pure alive
    
        pure Person{ age, alive }
  • Left does short-circuit from the surrounding code

    In fact, we can define a synonym for Left named throw if we want to better convey the presence of short-circuiting behavior::

    {-# LANGUAGE NamedFieldPuns #-}
    
    import Text.Read (readMaybe)
    
    data Person = Person { age :: Int, alive :: Bool } deriving (Show)
    
    example :: String -> String -> Either String Person
    example ageString aliveString = do
        age <- case readMaybe ageString of
            Nothing  -> throw "Invalid age string"
            Just age -> pure age
    
        if age < 0
            then throw "Negative age"
            else pure ()
    
        alive <- case readMaybe aliveString of
            Nothing    -> throw "Invalid alive string"
            Just alive -> pure alive
    
        pure Person{ age, alive }
    
    throw :: String -> Either String a
    throw = Left
  • Left / throw “return” any type of value

    If you look at the type of throw, the “return” type is a polymorphic type a because throw short-circuits (and therefore doesn’t need to return a real value of that type).

    This is why the type checker doesn’t complain when we do this:

    age <- case readMaybe ageString of
        Nothing  -> throw "Invalid age string"
        Just age -> pure age

    … because both branches of the case expression type-check as an expression that “return”s an Int. The left branch type-checks as a branch that returns an Int because it cheats and never needs to return anything and the right branch returns a bona-fide Int (the age).

    We can make this explicit by giving both branches of the case expression an explicit type annotation:

    age <- case readMaybe ageString of
        Nothing  -> throw "Invalid age string" :: Either String Int
        Just age -> pure age                   :: Either String Int

    This means that both branches of a case expression will always share the same return type if at least one branch is a Left / throw. Similarly, both branches of an if expression will share the same return type if at least one branch is a Left / throw:

    if age < 0
        then throw "Negative age" :: Either String ()
        else pure ()              :: Either String ()
  • We can return a value from a case expression

    New Haskell programmers might not realize that case expressions can return a value (just like any other expression), which means that their result can be stored as a new value within the surrounding scope:

    age <- case readMaybe ageString of
        Nothing  -> throw "Invalid age string" :: Either String Int
        Just age -> pure age                   :: Either String Int

    The type-checker doesn’t mind that only the second branch returns a valid age because it knows that the outer age is unreachable if the first branch short-circuits. This understanding is not built into the compiler, but is rather an emergent property of how do notation works for Either. See the Appendix for a fully-worked equational reasoning example showing why this works.

Combinators

You can build upon this trick by defining helpful utility functions to simplify things further. For example, I sometimes like to define the following helper function:

orDie :: Maybe a -> String -> Either String a
Just a  `orDie` _      = return a
Nothing `orDie` string = Left string

{- Equivalent, more explicit, implementation:

maybe `orDie` string =
    case maybe of
        Nothing -> Left string
        Just a  -> return a
-}

… which you can use like this:

{-# LANGUAGE NamedFieldPuns #-}

import Text.Read (readMaybe)

data Person = Person { age :: Int, alive :: Bool } deriving (Show)

orDie :: Maybe a -> String -> Either String a
Just a  `orDie` _      = Right a
Nothing `orDie` string = Left string

example :: String -> String -> Either String Person
example ageString aliveString = do
    age <- readMaybe ageString `orDie` "Invalid age string"

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- readMaybe aliveString `orDie` "Invalid alive string"

    return Person{ age, alive }

… which is even more clear (in my opinion).

Conclusion

That is the entirety of the trick. You can return values from case expressions to avoid deeply-nesting your Either code, or you can define utility functions (such as orDie) which do essentially the same thing.

This trick applies equally well to any other Monad that supports some notion of short-circuiting on failure, such as ExceptT (from transformers / mtl). The only essential ingredient is some throw-like function that short-circuits and therefore has a polymorphic return type.

Appendix

You can build a better intuition for why this works using equational reasoning. We’ll begin from an example usage of our function and at each step of the process we will either desugar the code or substitute one subexpression with another equal subexpression.

In all of the examples, we’ll begin from this definition for the example function:

example :: String -> String -> Either String Person
example ageString aliveString = do
    age <- case readMaybe ageString of
        Nothing  -> Left "Invalid age string"
        Just age -> return age

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe aliveString of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

Let’s first begin with the example that fails the most quickly:

example "" "True"

-- Substitute `example` with its definition:
--
-- … replacing `ageString` with `""`
--
-- … replacing `aliveString` with `"True"`
do  age <- case readMaybe "" of
        Nothing  -> Left "Invalid age string"
        Just age -> return age

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

-- Definition of `readMaybe` (not shown):
--
-- (readMaybe "" :: Maybe Int) = Nothing
do  age <- case Nothing of
        Nothing  -> Left "Invalid age string"
        Just age -> return age

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

-- Simplify the first `case` expression
do  age <- Left "Invalid age string"

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

-- Desugar `do` notation one step
--
-- (do age <- m; n) = (m >>= \age -> n)
Left "Invalid age string" >>= \age ->
    do  if age < 0
            then Left "Negative age"
            else return ()

        alive <- case readMaybe "True" of
            Nothing    -> Left "Invalid alive string"
            Just alive -> return alive

        return Person{ age, alive }

-- Definition of `>>=` for `Either`
--
-- (Left x >>= _) = (Left x)
Left "Invalid age string"

… and we’re done! The key step is the last bit where we simplify (Left … >>= _) to (Left …), which is how Either short-circuits on failure. Because this simplification does not use the downstream code everything type-checks just fine even though we never supply a valid age.

For completeness, let’s also walk through the example where everything succeeds:

example "24" "True"

-- Substitute `example` with its definition:
--
-- … replacing `ageString` with `"24"`
--
-- … replacing `aliveString` with `"True"`
do  age <- case readMaybe "24" of
        Nothing  -> Left "Invalid age string"
        Just age -> return age

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

-- Definition of `readMaybe` (not shown):
--
-- (readMaybe "24" :: Maybe Int) = (Just 24)
do  age <- case Just 24 of
        Nothing  -> Left "Invalid age string"
        Just age -> return age

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

-- Simplify the first `case` expression
do  age <- return 24

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

-- return = Right
do  age <- Right 24

    if age < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age, alive }

-- Desugar `do` notation one step
--
-- (do age <- m; n) = (m >>= \age -> n)
Right 24 >>= \age ->
    do  if age < 0
            then Left "Negative age"
            else return ()

        alive <- case readMaybe "True" of
            Nothing    -> Left "Invalid alive string"
            Just alive -> return alive

        return Person{ age, alive }

-- Definition of `>>=` for `Either`
--
-- (Right x >>= f) = (f x)
--
-- This means that we substitute `age` with `24`
do  if 24 < 0
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age = 24, alive }

-- (24 < 0) = False
do  if False
        then Left "Negative age"
        else return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age = 24, alive }

-- (if False then l else r) = r
do  return ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age = 24, alive }

-- return = Right
do  Right ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age = 24, alive }

-- (do m; n) = (do _ <- m; n)
do  _ <- Right ()

    alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age = 24, alive }

-- Desugar `do` notation one step
--
-- (do age <- m; n) = (m >>= \age -> n)
Right () >>= _ -> 
    do  alive <- case readMaybe "True" of
            Nothing    -> Left "Invalid alive string"
            Just alive -> return alive

        return Person{ age = 24, alive }

-- Definition of `>>=` for `Either`
--
-- (Right x >>= f) = (f x)
--
-- This means that we substitute `_` (unused) with `()`
do  alive <- case readMaybe "True" of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age = 24, alive }

-- Definition of `readMaybe` (not shown):
--
-- (readMaybe "True" :: Maybe Bool) = (Just True)
do  alive <- case Just True of
        Nothing    -> Left "Invalid alive string"
        Just alive -> return alive

    return Person{ age = 24, alive }

-- Simplify the `case` expression
do  alive <- return True

    return Person{ age = 24, alive }

-- return = Right
do  alive <- Right True

    return Person{ age = 24, alive }

-- Desugar `do` notation one step
--
-- (do age <- m; n) = (m >>= \age -> n)
Right True >>= \alive ->
    do  return Person{ age = 24, alive }

-- Definition of `>>=` for `Either`
--
-- (Right x >>= f) = (f x)
--
-- This means that we substitute `alive` with `True`
do  return Person{ age = 24, alive = True }

-- Desugar `do` notation
--
-- do m = m
return Person{ age = 24, alive = True }

-- return = Right
Right Person{ age = 24, alive = True }

As an exercise, you can try to extrapolate between those two examples to reason through what happens when we evaluate the remaining two examples which fail somewhere in between:

>>> example "24" "true"

>>> example "-5" "True"