Saturday, December 6, 2014

A very general API for relational joins

Maps and tuples are useful data types for modeling relational operations. For example, suppose we have the following table, indexed by the Id column:

| Id | First Name | Last Name |
|----|------------|-----------|
|  0 | Gabriel    | Gonzalez  |
|  1 | Oscar      | Boykin    |
|  2 | Edgar      | Codd      |

We can model that as a Map where the key is the Id column and the value is a tuple of FirstName and LastName:

import Data.Map (Map) -- from the `containers` library
import qualified Data.Map as Map

type Id        = Int
type FirstName = String
type LastName  = String

names :: Map Id (FirstName, LastName)
names = Map.fromList
    [ (0, ("Gabriel", "Gonzalez"))
    , (1, ("Oscar"  , "Boykin"  ))
    , (2, ("Edgar"  , "Codd"    ))
    ]

Now suppose we have another table containing Twitter handles, also indexed by Id:

| Id | Twitter Handle |
|----|----------------|
| 0  |  GabrielG439   |
| 1  |  posco         |
| 3  |  avibryant     |

We can also encode that as a Map:

type TwitterHandle = String

handles :: Map Id TwitterHandle
handles = Map.fromList
    [ (0, "GabrielG439")
    , (1, "posco"      )
    , (3, "avibryant"  )
    ]

One of the nice properties of Maps is that you can join them together:

import Control.Applicative

-- I'm using `join2` to avoid confusion with `Control.Monad.join`
join2 :: Map k v1 -> Map k v2 -> Map k (v1, v2)
join2 = ... -- Implementation elided for now

What if we could generalize join2 to work on types other than Map. Perhaps we could use the Applicative interface to implement join2:

join2 :: Applicative f => f a -> f b -> f (a, b)
join2 = liftA2 (,)

However, the Map type cannot implement Applicative in its current form. The reason why is that there is no sensible definition for pure:

pure :: v -> Map k v
pure = ???

This would require a Map that was defined for every key, which we cannot encode. Or can we?

Tables

Well, who says we need to use the Map type from containers? What if I were to encode my Map this way:

import Prelude hiding (lookup)

-- | A map encoded as a lookup function
newtype Table k v = Table { lookup :: k -> Maybe v }

-- | Encode a traditional map as a lookup function
from :: Ord k => Map k v -> Table k v
from m = Table (\k -> Map.lookup k m)

This new type of Map only permits a single operation: lookup, but because we constrain our API to this single operation we can now implement the full Applicative interface:

instance Functor (Table k) where
    fmap f (Table g) = Table (\k -> fmap f (g k))
           -- Same as: Table (fmap (fmap f) g)

instance Applicative (Table k) where
    pure v            = Table (\k -> Just v)
            -- Same as: Table (pure (pure v))

    Table f <*> Table x = Table (\k -> f k <*> x k)
              -- Same as: Table (liftA2 (<*>) f x)

We can promote conventional Maps to this new Table type using the above from function:

names' :: Table Id (FirstName, LastName)
names' = from names

handles' :: Table Id TwitterHandle
handles' = from handles

... and now the more general Applicative join2 will work on these two tables:

>>> let table = join2 names' handles'
>>> :type table
table :: Table Id ((FirstName, LastName), TwitterHandle)
>>> lookup table 0
Just (("Gabriel","Gonzalez"),"GabrielG439")
>>> lookup table 1
Just (("Oscar","Boykin"),"posco")
>>> lookup table 2
Nothing

However, in its present form we can't dump the table's contents because we don't know which keys are present in the table. Let's fix that by adding an additional field to the Table type listing the keys. We will treat functions as being defined for all keys:

import Data.Set (Set)
import qualified Data.Set as Set 

data Keys k = All | Some (Set k)

instance Ord k => Num (Keys k) where
    fromInteger 0         = Some Set.empty
    fromInteger n | n > 0 = All

    All     + _       = All
    _       + All     = All
    Some s1 + Some s2 = Some (Set.union s1 s2)

    All     * ks      = ks
    ks      * All     = ks
    Some s1 * Some s2 = Some (Set.intersection s1 s2)

-- | A map encoded as a lookup function
data Table k v = Table
    { keys   :: Keys k
    , lookup :: k -> Maybe v
    }

-- | Encode a traditional map as a lookup function
from :: Ord k => Map k v -> Table k v
from m = Table
    { keys   = Some (Set.fromList (Map.keys m))
    , lookup = \k -> Map.lookup k m
    }

Even after extending Table with keys we can still implement the Applicative interface:

instance Functor (Table k) where
    fmap f (Table ks g) = Table ks (fmap (fmap f) g)

instance Ord k => Applicative (Table k) where
    pure v =
        Table 1 (pure (pure v))

    Table s1 f <*> Table s2 x =
        Table (s1 * s2) (liftA2 (<*>) f x)

... and now we can add a Show instance, too!

instance (Show k, Show v) => Show (Table k v) where
    show (Table ks f) = case ks of
        All    -> "<function>"
        Some s -> unlines (do
            k <- Set.toList s
            let Just v = f k
            return (show (k, v)) )

Let's give it a test drive:

>>> names'
(0,("Gabriel","Gonzalez"))
(1,("Oscar","Boykin"))
(2,("Edgar","Codd"))

>>> handles'
(0,"GabrielG439")
(1,"posco")
(3,"avibryant")

>>> join2 names' handles'
(0,(("Gabriel","Gonzalez"),"GabrielG439"))
(1,(("Oscar","Boykin"),"posco"))

So far, so good!

Alternative

However, we need to support more than just inner joins. We'd also like to support left, right, and outer joins, too.

Conceptually, a left join is one in which values from the right table may be optionally present. One way we could implement this would be to define a function that converts a finite map to a function defined on all keys. This function will return Nothing for keys not present in the original finite map and Just for keys that were present:

optional :: Table k v -> Table k (Maybe v)
optional (Table ks f) =
    Table All (\k -> fmap Just (f k) <|> pure Nothing)

Then we could define leftJoin in terms of join2 and optional:

leftJoin :: Table k a -> Table k b -> Table k (a, Maybe b)
leftJoin t1 t2 = join2 t1 (optional t2)

However, if we try to compile the above code, the compiler will give us a really interesting error message:

Ambiguous occurrence ‘optional’
It could refer to either ‘Main.optional’,
                      or ‘Control.Applicative.optional’

Apparently, Control.Applicative has an optional function, too. Let's pause to check out the type of this mysterious function:

optional :: Alternative f => f v -> f (Maybe v)

Wow! That type signature is suprisingly similar to the one we wrote. In fact, if Table k implemented the Alternative interface, the types would match.

Alternative is a type class (also provided by Control.Applicative) that greatly resembles the Monoid type class:

class Applicative f => Alternative f where
    empty :: f a

    (<|>) :: f a -> f a -> f a

... and the core Alternative laws are identical to the Monoid laws:

x <|> empty = x

empty <|> x = x

(x <|> y) <|> z = x <|> (y <|> z)

Let's dig even deeper and see how optional uses this Alternative type class:

optional v = fmap Just v <|> pure Nothing

Even the implementation is eerily similar! This strongly suggests that we should find a way to make our Table type implement Alternative. Perhaps something like this would work:

instance Ord k => Alternative (Table k) where
    empty =
        Table 0 (pure empty)

    Table ks1 f1 <|> Table ks2 f2 =
        Table (ks1 + ks2) (liftA2 (<|>) f1 f2)

Compare this to our Applicative instance (duplicated here):

instance Ord k => Applicative (Table k) where
    pure v =
        Table 1 (pure (pure v))

    Table s1 f <*> Table s2 x =
        Table (s1 * s2) (liftA2 (<*>) f x)

The Alternative instance mirrors our Applicative instance except that we:

  • replace pure v with empty
  • replace (<*>) with (<|>)
  • replace 1 with 0
  • replace (*) with (+)

... and what's really neat is that now the optional function from Control.Applicative behaves just like the original optional function we wrote. (Exercise: Use equational reasoning to verify this)

Derived joins

Armed with this Alternative instance, we can now implement leftJoin in terms of the optional provided by Control.Applicative:

leftJoin t1 t2 = join2 t1 (optional t2)

Sure enough, it works:

>>> leftJoin names' handles'
(0,(("Gabriel","Gonzalez"),Just "GabrielG439"))
(1,(("Oscar","Boykin"),Just "posco"))
(2,(("Edgar","Codd"),Nothing)

Let's check out the type that the compiler infers for leftJoin:

>>> :type leftJoin
leftJoin :: Alternative f => f a -> f b -> f (a, Maybe b)

Notice how there's no longer anything Table-specific about leftJoin. It works for anything that implements the Alternative interface. I could leftJoin two Maybes if I really wanted to:

>>> leftJoin (Just 1) (Just 2)
Just (1,Just 2)
>>> leftJoin (Just 1) Nothing
Just (1,Nothing)
>>> leftJoin Nothing (Just 1)
Nothing

... or two lists:

>>> leftJoin [1, 2] [3, 4]
[(1,Just 3),(1,Just 4),(1,Nothing),(2,Just 3),(2,Just 4),(2,
Nothing)]

In fact, I don't even really need specialized leftJoin or rightJoin functions. optional is sufficiently light-weight that I could inline a right join on the fly:

>>> join2 (optional names') handles'
(0,(Just ("Gabriel","Gonzalez"),"GabrielG439"))
(1,(Just ("Oscar","Boykin"),"posco"))
(3,(Nothing,"avibryant"))

What happens if I try to do an "outer join"?

>>> -- DISCLAIMER: Technically not the same as an SQL outer join
>>> let o = join2 (optional names') (optional handles')
>>> o
<function>

The above "outer join" is defined for all keys (because both sides are optional), so we get back a function! While we can't list the Table (because it's conceptually infinite), we can still perform lookups on it:

>>> lookup o 0
Just (Just ("Gabriel","Gonzalez"),Just "GabrielG439")
>>> lookup o 2
Just (Just ("Edgar","Codd"),Nothing)
>>> lookup o 3
Just (Nothing,Just "avibryant")
>>> lookup o 4
Just (Nothing, Nothing)

... and if we were to join our "infinite" table against a finite table, we get back a finite table (Exercise: Try it! Define a new finite table to join against o and see what happens)

What's nice about optional is that we can easily left-join or right-join in multiple tables at a time. If I had four tables of types:

t1 :: Table k a
t2 :: Table k b
t3 :: Table k c
t4 :: Table k d

... I could left join t2, t3, and t4 into t1 by just writing:

liftA4 (,,,) t1 (optional t2) (optional t3) (optional t4)
    :: Table k (a, Maybe b, Maybe c, Maybe d)

Now that I think about it, I don't even really need to provide join2/join3/join4/join5 since they are not much shorter than using the liftA family of functions in Control.Applicative:

-- Exercise: What would `join1` be?
join2 = liftA2 (,)
join3 = liftA3 (,,)
join4 = liftA4 (,,,)
join5 = liftA5 (,,,,)

In other words, I can implement almost any imaginable join just by using liftA{n} and some permutation of optionals. I don't even know what I'd call this join:

liftA5 (,,,,) t1 (optional t2) t3 (optional t4) t5

... but the beauty is that I don't have to give a name for it. I can easily write anonymous joins on the fly using the Control.Applicative module. Moreover, the above code will work for anything that implements the Alternative interface.

Conclusion

Control.Applicative provides a very general API for relational joins: the Alternative type class (which includes Applicative, since Applicative is a super-class of Alternative). Perhaps Control.Applicative could be improved slightly by providing the join{n} family of functions listed above, but it's still highly usable in its present state.

Note that this trick only works for relational abstractions embedded within Haskell. This API can be generalized for external relational data stores (i.e. Postgres), which I will cover in a subsequent post.

2 comments:

  1. I've made something very similar to these before to describe the interface to a database system. I called the composition of a function with Maybe a "partial function" (which unfortunately is strongly associated with computability) and a function with known keys an "enumerable partial function". The known keys can be viewed as a meet semi-lattice with an absorbing element bottom. This doesn't cover All keys, which I handled by considering any bounded lattice where we can test for bottom. A nice thing about working with lattices is when implementing the database you can define generalized indexed search trees elegantly in terms of a lattice equipped with a measure. In a way you are breaking each relation into the product of a lattice that can be reasoned about and a function that can't.

    What I didn't realize, and your post demonstrates very nicely, is that the bounded lattice gives rise to an Alternative instance.

    ReplyDelete