Wednesday, May 6, 2015

Haskell content spinner

Recently somebody posted a template for generating blog comment spam, so I thought: "What sillier way to show how elegant Haskell is than generating comment spam?!"

The first "stanza" of the template looks like this:

{I have|I've} been {surfing|browsing} online 
more than {three|3|2|4} hours today, yet
I never found any interesting article like yours. 
{It's|It is}
pretty worth enough for me. {In my 
opinion|Personally|In my view},
if all {webmasters|site owners|website owners|web 
owners} and bloggers made good content as you 
did, the {internet|net|web} will be {much more|a 
lot more} useful than ever before.|
I {couldn't|could not} {resist|refrain from} 
commenting.
{Very well|Perfectly|Well|Exceptionally well} 
written!|
{I will|I'll} {right away|immediately} {take 
hold of|grab|clutch|grasp|seize|snatch} your 
{rss|rss feed} as I {can not|can't} {in 
finding|find|to find} your {email|e-mail} 
subscription {link|hyperlink} or 
{newsletter|e-newsletter} service.
Do {you have|you've} any? {Please|Kindly} 
{allow|permit|let} me 
{realize|recognize|understand|recognise|know} {so 
that|in order that} I {may
just|may|could} subscribe. Thanks.|
{It is|It's} {appropriate|perfect|the best} 
time to
make some plans for the future and {it is|it's} 
time to be happy. 

Anything of the form {x|y|z} represents a choice between alternative text fragments x, y, and z. The above template has four large alternative comments to pick from, each with their own internal variations. The purpose of these alternatives is to evade simple spam detection algorithms, much like how some viruses evade the immune system by mutating antigens.

I wanted to write a Haskell program that selected a random template from one of the provided alternatives and came up with this:

{-# LANGUAGE OverloadedStrings #-}

import Control.Foldl (random)  -- Requires `foldl-1.0.10` or higher
import Turtle

main = do
    x <- foldIO spam random
    print x

spam :: Shell Text
spam =  -- 1st major template
       ""
     * ("I have" + "I've")
     *  " been "
     * ("surfing" + "browsing")
     *  " online more than "
     * ("three" + "3" + "2" + "4")
     *  " hours today, yet I never found any interesting article like yours. "
     * ("It's" + "It is")
     *  " pretty worth enough for me. "
     * ("In my opinion" + "Personally" + "In my view")
     *  ", if all "
     * ("webmasters" + "site owners" + "website owners" + "web owners")
     *  " and bloggers made good content as you did, the "
     * ("internet" + "net" + "web")
     *  " will be "
     * ("much more" + "a lot more")
     *  " useful than ever before."

       -- 2nd major template
   +   " I "
     * ("couldn't" + "could not")
     *  " "
     * ("resist" + "refrain from")
     *  " commenting. "
     * ("Very well" + "Perfectly" + "Well" + "Exceptionally well")
     *  " written!"

        -- 3rd major template
   +    " "
     * ("I will" + "I'll")
     *  " "
     * ("right away" + "immediately")
     *  " "
     * ("take hold of" + "grab" + "clutch" + "grasp" + "seize" + "snatch")
     *  " your "
     * ("rss" + "rss feed")
     *  " as I "
     * ("can not" + "can't")
     *  " "
     * ("in finding" + "find" + "to find")
     *  " your "
     * ("email" + "e-mail")
     *  " subscription "
     * ("link" + "hyperlink")
     *  " or "
     * ("newsletter" + "e-newsletter")
     *  " service. Do "
     * ("you have" + "you've")
     *  " any? "
     * ("Please" + "Kindly")
     *  " "
     * ("allow" + "permit" + "let")
     *  " me "
     * ("realize" + "recognize" + "understand" + "recognise" + "know")
     *  " "
     * ("so that" + "in order that")
     *  " I "
     * ("may just" + "may" + "could")
     *  " subscribe. Thanks."

        -- 4th major template
   +    " "
     * ("It is" + "It's")
     *  " "
     * ("appropriate" + "perfect" + "the best")
     *  " time to make some plans for the future and "
     * ("it is" + "it's")
     *  " time to be happy."

Conceptually, all I did to embed the template in Haskell was to:

  • add a quote to the beginning of the template: "
  • replace all occurences of { with "*(" (including quotes)
  • replace all occurences of } with ")*" (including quotes)
  • replace all occurences of | with "+" (including quotes)
  • add a quote to the end of the template: "

In fact, I mechanically transformed the template to Haskell code using simple sed commands within vi and then just formatted the result to be more readable.

Before explaining why this works, let's try our program out to verify that it works:

$ ghc -O2 spam.hs
$ ./spam
Just " I will right away grab your rss feed as I 
can not find your email subscription hyperlink or 
newsletter service. Do you've any? Please let me 
recognise in order that I could subscribe. 
Thanks."
$ ./spam
Just " I'll immediately seize your rss as I can not find 
your email subscription link or e-newsletter service. Do 
you have any? Please allow me realize in order that I may 
subscribe. Thanks."

You might wonder: how does the above program work?

Types

Let's begin from the type of the top-level utility named foldIO:

foldIO
    :: Shell a       -- A stream of `a`s
    -> FoldM IO a b  -- A fold that reduces `a`s to a single `b`
    -> IO b          -- The result (a `b`)

foldIO connects a producer of as (i.e. a Shell) to a fold that consumes as and produces a single b (i.e. a FoldM). For now we will ignore how they are implemented. Instead we will play type tetris to see how we can connect things together.

The first argument we supply to foldIO is spam, whose type is:

spam :: Shell Text

Think of a Shell as a stream, and spam is a stream whose elements are Text values. Each element of this stream corresponds to one possible alternative for our template. For example, a template with exactly one alternative would be a stream with one element.

When we supply spam as the first argument to foldIO, the compiler infers that the first a in the type of foldIO must be Text

foldIO :: Shell a -> FoldM IO a b -> IO b
                ^
                |
                |
                |
spam   :: Shell Text

... therefore, the second a must also be Text:

foldIO :: Shell a -> FoldM IO a b -> IO b
                ^             ^
                |             |
                +-------------+
                |
spam   :: Shell Text

... so in this context foldIO has the more specialized type:

foldIO :: Shell Text -> FoldM IO Text b -> IO b

... and when we apply foldIO to spam we get the following narrower type:

foldIO spam :: FoldM IO Text b -> IO b

Now all we need to do is to provide a fold that can consume a stream of Text elements. We choose the random fold, which uses reservoir sampling to pick a random element from the stream. The type of random is:

random :: FoldM IO a (Maybe a)

In other words, given an input stream of as, this fold reduces the stream to a single Maybe a. The Maybe is either Nothing if the stream is empty or Just some random element from the stream if the stream is non-empty.

When we supply random as the second argument to foldIO, the compiler infers that the a in random must be Text:

foldIO spam :: FoldM IO Text        b -> IO b
                        |
                        |
                        |
                        v
random      :: FoldM IO a    (Maybe a)

... therefore the second a must also be Text:

foldIO spam :: FoldM IO Text        b -> IO b
                        |
                        +-----------+
                        |           |
                        v           v
random      :: FoldM IO a    (Maybe a)

So the specialized type of random becomes:

foldIO spam :: FoldM IO Text        b     -> IO b

random      :: FoldM IO Text (Maybe Text)

Now we can apply type inference in the opposite direction! The compiler infers that the b in the type of foldIO must be Maybe Text:

foldIO spam :: FoldM IO Text        b     -> IO b
                                    ^
                                    |
                                    |
                                    |
random      :: FoldM IO Text (Maybe Text)

... therefore the other b must also be Maybe Text:

foldIO spam :: FoldM IO Text        b     -> IO b
                                    ^           ^
                                    |           |
                                    +-----------+
                                    |
random      :: FoldM IO Text (Maybe Text)

... so we specialize foldIO's type even further to:

foldIO spam :: foldM IO Text (Maybe Text) -> IO (Maybe Text)

... and when we apply that to random the type simplifies down to:

foldIO spam random :: IO (Maybe Text)

The end result is a subroutine that loops over the stream using reservoir sampling, selects a random element (or Nothing if the stream is empty), and then returns the result.

All that's left is to print the result:

main = do
    x <- foldIO spam random
    print x

So that explains the top half of the code, but what about the bottom half? What is up with the addition and multiplication of strings?

Overloading

The first trick is that the strings are actually not strings at all! Haskell lets you overload string literals using the OverloadedStrings extension so that they type-check as any type that implements the IsString type class. The Shell Text type is one such type. If you provide a string literal where the compiler expects a Shell Text then the compiler will instead build a 1-element stream containing just that string literal.

The second trick is that Haskell lets you overload numeric operatoins to work on any type that implements the Num type class. The Shell Text type implements this type class, so you can add and multiply streams of text elements.

The behavior of addition is stream concatenation. In our template, when we write:

"Very well" + "Perfectly" + "Well" + "Exceptionally well"

... we are really concatenating four 1-element streams into a combined 4-element stream representing the four alternatives.

The behavior of multiplication is to sequence two templates. Either template may be a stream of multiple alternatives so when we sequence them we take the "cartesian product" of both streams. When we multiply two streams we concatenate every alterntaive from the first stream with every alternative from the second stream and return all possible combinations.

For example, when we write:

("couldn't " + "could not ") * ("resist" + "refrain from")

This reduces to four combinations:

  "couldn't resist"
+ "couldn't refrain from"
+ "could not resist"
+ "could not refrain from"

You can actually derive this using the rules of addition and multiplication:

("couldn't " + "could not ") * ("resist" + "refrain from")

-- Multiplication distributes over left addition
  "couldn't "  * ("resist" + "refrain from")
+ "could not " * ("resist" + "refrain from")

-- Multiplication distributes again
  "couldn't "   * "resist"
+ "couldn't "   * "refrain from"
+ "could not "  * "resist"
+ "could not "  * "refrain from"

-- Multiplying 1-element templates just sequences them
  "couldn't resist"
+ "couldn't refrain from"
+ "could not resist"
+ "could not refrain from"

Notice how if we sequence two 1-element templates

"I have " * "been"

... it's identical to string concatenation:

"I have been"

And that's it! We build the template using arithmetic and then we fold the results using random to select one template at random. That's the complete program! Or did we?

Weighting

Actually there's one catch: our spam generator is very heavily biased towards the third major template. This is because the generator weights all alternatives equally, but each major template has a different number of alternatives:

  • 1st template: 2304 alternatives
  • 2nd template: 16 alternatives
  • 3rd template: 829440
  • 4th template: 12 alternatives

As a result we're 360 times more likely to get the 3rd template than the next most common template (the 1st one). How can we weight each template to undo this bias?

The answer is simple: we can weight each template by using multiplication, scaling each template by the appropritae numeric factor.

In this case, the weights we will apply are:

  • 1st template: Increase frequency by 360x
  • 2nd template: Increase frequency by 51840x
  • 3rd template: Keep frequency the same (1x)
  • 4th template: Increase frequency by 69120x

Here's the implementation:

spam =  -- 1st major template
        360
     *  ""
     * ("I have" + "I've")
     *  " been "
     * ("surfing" + "browsing")
     *  " online more than "
     * ("three" + "3" + "2" + "4")
     *  " hours today, yet I never found any interesting article like yours. "
     * ("It's" + "It is")
     *  " pretty worth enough for me. "
     * ("In my opinion" + "Personally" + "In my view")
     *  ", if all "
     * ("webmasters" + "site owners" + "website owners" + "web owners")
     *  " and bloggers made good content as you did, the "
     * ("internet" + "net" + "web")
     *  " will be "
     * ("much more" + "a lot more")
     *  " useful than ever before."

        -- 2nd major template
   +    51840
     *  " I "
     * ("couldn't" + "could not")
     *  " "
     * ("resist" + "refrain from")
     *  " commenting. "
     * ("Very well" + "Perfectly" + "Well" + "Exceptionally well")
     *  " written!"

        -- 3rd major template
   +    1
     *  " "
     * ("I will" + "I'll")
     *  " "
     * ("right away" + "immediately")
     *  " "
     * ("take hold of" + "grab" + "clutch" + "grasp" + "seize" + "snatch")
     *  " your "
     * ("rss" + "rss feed")
     *  " as I "
     * ("can not" + "can't")
     *  " "
     * ("in finding" + "find" + "to find")
     *  " your "
     * ("email" + "e-mail")
     *  " subscription "
     * ("link" + "hyperlink")
     *  " or "
     * ("newsletter" + "e-newsletter")
     *  " service. Do "
     * ("you have" + "you've")
     *  " any? "
     * ("Please" + "Kindly")
     *  " "
     * ("allow" + "permit" + "let")
     *  " me "
     * ("realize" + "recognize" + "understand" + "recognise" + "know")
     *  " "
     * ("so that" + "in order that")
     *  " I "
     * ("may just" + "may" + "could")
     *  " subscribe. Thanks."

        -- 4th major template
   +    69120
     *  " "
     * ("It is" + "It's")
     *  " "
     * ("appropriate" + "perfect" + "the best")
     *  " time to make some plans for the future and "
     * ("it is" + "it's")
     *  " time to be happy."

Now this produces a fairer distribution between the four major alternatives:

$ ./spam
Just "I have been surfing online more than three 
hours today, yet I never found any interesting 
article like yours. It's pretty worth enough for 
me. In my view, if all web owners and bloggers 
made good content as you did, the internet will 
be a lot more useful than ever before."
$ ./spam
Just " It's the best time to make some plans for 
the future and it's time to be happy."
$ ./spam
Just " I will right away clutch your rss feed as 
I can't in finding your e-mail subscription link 
or newsletter service. Do you have any? Kindly 
let me understand so that I may just subscribe. 
Thanks."
$ ./spam
Just " I could not refrain from commenting. Exceptionally well written!"

Remember how we said that Shell Text implements the Num type class in order to get addition and multiplication? Well, you can also use the same Num class to overload integer literals. Any time the compiler sees an integer literal where it expects a Shell Text it will replace that integer with a stream of empty strings whose length is the given integer.

For example, if you write the number 3, it's equivalent to:

-- Definition of 3
3 = 1 + 1 + 1

-- 1 = ""
3 = "" + "" + ""

So if you write:

3 * "some string"

... that expands out to:

("" + "" + "") * "some string"

... and multiplication distributes to give us:

("" * "some string") + ("" * "some string") + ("" * "some string")

... which reduces to three copies of "some string":

"some string" + "some string" + "some string"

This trick works even when multiplying a number by a template with multiple alternatives:

2 * ("I have" + "I've")

-- 2 = 1 + 1
= (1 + 1) * ("I have" + "I've")

-- 1 = ""
= ("" + "") * ("I have" + "I've")

-- Multiplication distributes
= ("" * "I have") + ("" * "I've") + ("" * "I have") + ("" * "I've")

-- Simplify
= "I have" + "I've" + "I have" + "I've"

Arithmetic

In fact, Shell Text obeys all sort of arithmetic laws:

-- `0` is the identity of addition
0 + x = x
x + 0 = x

-- Addition is associative
(x + y) + z = x + (y + z

-- `1` is the identity of multiplication
1 * x = 1
x * 1 = 1

-- Multiplication is associative
(x * y) * z = x * (y * z)

-- Multiplication right-distributes over addition
(x + y) * z = (x * z) + (y * z)

-- 1-element streams left-distribute over addition
"string" * (x + y) = ("string" * x) + ("string" * y)

I'm not sure what the mathematical name is for this sort of structure. I usually call this a "semiring", but that's technically not correct because in a semiring we expect addition to commute, but here it does not because Shell Text preserves the ordering of results. For the case of selecting a random element ordering does not matter, but there are other operations we can perform on these streams that are order-dependent.

In fact, the laws of arithmetic enforce that you weigh all fine-grained alternatives equally, regardless of the frequencies of the top-level coarse-grained alternatives. If you weighted things based on the relative frequencies of top-level alternatives you would get very inconsistent behavior.

For example, if you tried to be more "fair" for outer alternatives, then addition stops being associative, meaning that these templates would no longer behave the same:

{x|{y|z}}
{{x|y}|z}
{x|y|z}

Conclusion

The Haskell template generator was so concise for two main reasons:

  • We embedded the template directly within Haskell, so we skipped having to parse the template
  • We reuse modular and highly generic components (like random, foldIO, (+), and (*) and numeric literals) instead of writing our own custom code

Also, our program is easily modifiable. For example, if we want to collect all the templates, we just replace random with vector:

import Control.Foldl (vector)
import Data.Vector (Vector)

main = do
    x <- foldIO spam vector
    print (x :: Vector Text)

That efficiently builds a vector in place using mutation that stores all the results, then purifies the result.

However, these sorts of tricks come at a cost. Most of the awesome tricks like this are not part of the standard library and instead exist in libraries, making them significantly harder to discover. Worse, you're never really "done" learning the Haskell language. The library ecosystem is a really deep rabbit hole full of jewels and at some point you just have to pick a point to just stop digging through libraries and build something useful ...

... or useless, like a comment spam generator.

3 comments:

  1. I could not refrain from commenting. Exceptionally well written!

    ReplyDelete
  2. Great post!

    In regards to the name of the structure in question, I have heard it referred to as a "semi near-ring" or a "near semiring" in the past.

    ReplyDelete