Haskell for all: Forward and reverse proxies explained

proxy

This post explains what is the difference between a forward proxy and a reverse proxy. This post will likely be most useful for:

engineers designing on-premises enterprise software to forward traffic in restrictive networking environments
authors of forward and reverse proxy packages / frameworks
people who want to learn more about networking in general

I’m writing this mainly because I had to puzzle through all of this when I was designing part of our product’s network architecture at work and I frequently explain what I learned to my coworkers. Most of the existing explanations of proxies were not helpful to me when I learned this for the first time, so I figured I would try to explain things in my own words so that I can reference this post when teaching others.

This post will not touch upon the use cases for forward and reverse proxies and instead will mostly focus on the architectural differences between them. If you are interested in learning more about how they are used in practice then I recommend reading the Wikipedia article on proxies:

Wikipedia - Proxy server

Also, this post assumes some familiarity with the HTTP and HTTPS protocols.

The difference

The simplest difference between the two types of proxies is that:

A reverse proxy is a proxy where the proxy selects the origin server
A forward proxy is a proxy where the client selects the origin server

… and “origin server” means the server where the HTTP resource originates from that the proxy forwards the HTTP request to.

There are a few alternative definitions for forward proxy and reverse proxy, but in my experience the above definitions promote the correct intuition.

`curl` examples

I sometimes explain the difference between forward and reverse proxies in terms of curl commands.

For example, if I host a reverse proxy at reverse.example.com that forwards requests to google.com, then a sample HTTP request going through the reverse proxy might look like the following curl command:

$ curl https://reverse.example.com

In a reverse proxy, the client (e.g. curl) does not select the origin server (e.g. google.com). In this particular example the origin server is hard-coded into the reverse proxy and there would be no way for the client to specify that the HTTP request should be forwarded to github.com instead.

The data flow for such a request looks something like this:

┌──────────────────────────────────┐
│                                  │
│            google.com            │
│                                  │
└──────────────────────────────────┘
         ↑                ↓
 "GET / HTTP/1.1" "HTTP/1.1 200 OK"
         ↑                ↓
┌──────────────────────────────────┐
│                                  │
│        reverse.example.com       │
│                                  │
└──────────────────────────────────┘
         ↑                ↓
 "GET / HTTP/1.1" "HTTP/1.1 200 OK"
         ↑                ↓
┌──────────────────────────────────┐
│                                  │
│               curl               │
│                                  │
└──────────────────────────────────┘

In other words:

The client (e.g. curl) sends an HTTP request to the reverse proxy
The proxy (e.g. reverse.example.com) forwards the HTTP request to the origin server
The origin server (e.g. google.com) responds to the HTTP request
The proxy forwards the HTTP response back to the client

Now suppose that I host a forward proxy at forward.example.com. A sample HTTP request destined for google.com going through the forward proxy might look like this:

$ curl --proxy https://forward.example.com https://google.com

In a forward proxy, the client (e.g. curl) selects the origin server (e.g. google.com) and could have potentially selected a different origin server, such as github.com, like this:

$ curl --proxy https://forward.example.com https://github.com

… and then the forward proxy would forward the request to github.com instead¹.

The data flow for a request going through a forward proxy depends on whether the client connects to the origin server using HTTP or HTTPS.

If the client uses HTTP then the data flow for a forward proxy typically looks like this:

┌──────────────────────────────────────────────────┐
│                                                  │
│                    google.com                    │
│                                                  │
└──────────────────────────────────────────────────┘
                ↑                         ↓
        "GET / HTTP/1.1"          "HTTP/1.1 200 OK"
                ↑                         ↓
┌──────────────────────────────────────────────────┐
│                                                  │
│                forward.example.com               │
│                                                  │
└──────────────────────────────────────────────────┘
                ↑                         ↓
 "GET http://google.com HTTP/1.1" "HTTP/1.1 200 OK"
                ↑                         ↓
┌──────────────────────────────────────────────────┐
│                                                  │
│                       curl                       │
│                                                  │
└──────────────────────────────────────────────────┘

In other words:

The client (e.g. curl) sends an HTTP request to the proxy with the origin server in the request line
The proxy (e.g. forward.example.com) forwards the request to the origin server specified on the request line
The origin server (e.g. google.com) responds to the HTTP request
The proxy forwards the HTTP response back to the client

The main difference from the previous example is the HTTP request line that curl sends to forward.example.com. Forward proxies receive an absolute URI from the client (e.g. http://google.com) instead of a relative URI (like /). This is how the forward proxy knows where to forward the client’s request.

Now contrast that with the data flow for a forward proxy when the client uses HTTPS:

                             ┌────────────┐
                             │            │
                             │ google.com │
                             │            │
                             └────────────┘
                                    ↑
                                   TCP
                                    │
┌───────────────────────────────────│─┐
│                                   │ │
│          forward.example.com      │ │
│                                   │ │
└───────────────────────────────────│─┘
                ↑                   │
 "CONNECT google.com:443 HTTP/1.1" TCP
                ↑                   ↓
┌─────────────────────────────────────┐
│                                     │
│                curl                 │
│                                     │
└─────────────────────────────────────┘

This is a different flow of information:

The client (e.g. curl) sends a CONNECT request to the proxy
The proxy (e.g. forward.example.com) opens a TCP connection to the origin server (e.g. google.com)
The proxy forwards the rest of the client’s connection as raw TCP traffic directly to the origin server
- This TCP connection is encrypted, so by default the proxy cannot intercept or modify HTTPS traffic

However, despite the difference in data flow both HTTP forward proxies and HTTPS forward proxies let the client select the origin server. This is why we still call them both forward proxies even though they are architecturally different server types.

`nc` examples

If you’re curious, you can verify the difference in curl’s behavior when using HTTP vs HTTPS forward proxies on the command line. If you set up nc to listen on port 8000:

$ nc -l 8000

… and then ask curl to use localhost:8000 as a forward proxy, you’ll see different results depending on whether the origin server URI uses an HTTP or HTTPS scheme. For example, if you run:

$ curl --proxy http://localhost:8000 --data 'A secret!' http://example.com

… then nc will print the following incoming request from curl:

$ nc -l 8000
POST http://example.com/ HTTP/1.1
Host: example.com
User-Agent: curl/7.64.1
Accept: */*
Proxy-Connection: Keep-Alive
Content-Length: 9
Content-Type: application/x-www-form-urlencoded

A secret!

Note that the request line has an absolute URI specifying the origin server to connect to. Also, the HTTP forward proxy has complete access to the contents of the request (including the headers and payload) and can tamper with them before sending the request further upstream.

Contrast that with an HTTPS request that goes through a forward proxy:

$ curl --proxy http://localhost:8000 --data 'A secret!' https://example.com

… where now nc will print an incoming request that looks like this:

$ nc -l 8000
CONNECT example.com:443 HTTP/1.1
Host: example.com:443
User-Agent: curl/7.64.1
Proxy-Connection: Keep-Alive

This time the HTTP method is CONNECT (regardless of what the original method was) and the client only divulges enough information to the proxy to establish the TCP tunnel to example.com.

Blurring the line

Sometimes people configure reverse proxies to behave like a “poor person’s forward proxy”. For example, you could imagine a reverse proxy configured to select the origin server based on the value of an HTTP header:

$ curl --header 'Origin: google.com' https://reverse.example.com

… or based on the URI:

$ curl https://reverse.example.com/?url=google.com

… or based on a subdomain (one per supported origin server):

$ # Yes, I have actually seen this in the wild
$ curl https://google.reverse.example.com

This sort of thing is possible, and there are some constrained use cases for doing things this way, but you should err on the side of using a forward proxy if you intend to let the client select the origin server. Forward proxy software (like squid) will better support this use case than reverse proxy software (like haproxy) and HTTP clients (like your browser or curl) will also better support forward proxies for this use case.

Example pseudocode

To further clarify the difference, here is some example pseudocode for how one would implement the various types of proxies.

The request handler for a hand-written reverse proxy might look something like the following Python pseudocode:

def handleRequest(request):
    # Forward the incoming request further upstream to some predefined origin
    # server (e.g. google.com), typically with some changes to the HTTP headers
    # that we won't cover in this post.
    response = httpRequest(
        host = "https://google.com",
        method = request.method,
        headers = fixRequestHeaders(request.headers),
        path = request.path,
        query = request.query,
        body = request.body
    )

    # Now forward the origin server's response back to the client
    respond(
        headers = fixResponseHeaders(response.headers),
        statusCode = response.statusCode,
        body = response.body
    )

The request handler for an HTTP forward proxy would look something like this:

def handleRequest(request):
    # A request to an HTTP forward proxy has a request line with an absolute
    # URI:
    #
    #     ${METHOD} ${SCHEME}://${HOST}${PATH} HTTP/1.1
    #
    # … which is how the proxy knows where to forward the request further
    # upstream.
    upstreamURI = request.absoluteURI

    if upstreamURI is not None:
        # Other than obtaining the origin server from the request line, the rest
        # is similar to a reverse proxy.
        response = httpRequest(
            host = upstreamURI.host,
            method = request.method,
            headers = fixRequestHeaders(request.headers),
            path = upstreamURI.path,
            query = upstreamURI.query,
            body = request.body,
        )

        respond(
            headers = fixResponseHeaders(response.headers),
            statusCode = response.statusCode,
            body = response.body
        )
    else:
        respond(
            statusCode = 400,
            body = "The request line did not have an absolute URI",
        )

On the other hand, a HTTPS forward proxy would look something like this:

def handleRequest(request):
    # A request to an HTTPS forward proxy has a request line of the form:
    #
    #     CONNECT ${HOST}:${PORT} HTTP/1.1
    if request.connect is not None:
       forwardTcp(
           host = request.connect.host,
           port = request.connect.port,
           socket = request.socket
       )
    else:
        respond(
            statusCode = 400,
            body = "The HTTP method was not CONNECT"
        )

Combining a forward and reverse proxy

This section illustrates a scenario where you might combine a forward proxy and a reverse proxy to further reinforce the difference between the two.

Suppose that we wish to create a forward proxy reachable at https://both.example.com using squid. The problem is that squid does not provide out-of-the-box support for listening to HTTPS connections to squid itself.

Carefully note that squid can easily forward an HTTPS connection (using a CONNECT tunnel, as explained above), but squid does not support encrypting the client’s initial HTTP request to squid itself requesting the establishment of the CONNECT tunnel.

However, we can run a reverse proxy that can handle TLS termination (such as haproxy) listening on https://both.example.com. The reverse proxy’s sole job is to accept encrypted HTTPS requests sent to https://both.example.com and then forward the decrypted HTTP request to the squid service running on the same machine. squid then forwards the request further depending on the origin server specified in the HTTP request line.

In this scenario, squid is still acting as a forward proxy, in the sense that squid uses the HTTP request line provided by the client to determine where to forward the request. Vice versa, haproxy is acting as a reverse proxy because haproxy always forwards all incoming requests to squid regardless of what the client specifies.

This scenario illustrates one reason why I don’t define forward or reverse proxies in terms of which network or machine the proxy runs on. In this example, both the forward proxy and reverse proxy are running on the same network and machine.

There are other types of proxies that this post didn’t cover, but if you want to learn more you should also check out:

Transparent proxies

These are proxies where the client doesn’t specify the proxy at all because it intercepts all client traffic. I personally view transparent proxies as a third type of forward proxy (the client still selects the origin server), although transparent proxies are architecturally very different from non-transparent forward proxies.
SSL Forward Proxy

This is a variation on a forward proxy that can decrypt HTTPS traffic. However, this requires the client to opt into this decryption in some way.

Some forward proxies do not forward all HTTP requests. Instead, a forward proxy might specify which hostnames, protocols, and ports to permit in an access control list. This means that the example from the text would only work if the forward proxy’s access control lists permitted requests to google.com and github.com.↩︎

3 comments:

Senthil KumaranSeptember 1, 2021 at 4:44 PM
Wow, this is simplest and very clear explanation of forward and reverse proxy that I have ever read.
SudsySeptember 1, 2021 at 6:29 PM
Are you quite sure you mean "origin" server? In all cases in your explanation, replacing "origin" with "destination" makes more sense.

Haskell for all

Wednesday, September 1, 2021

Forward and reverse proxies explained

The difference

`curl` examples

`nc` examples

Blurring the line

Example pseudocode

Combining a forward and reverse proxy

3 comments:

Followers

Wednesday, September 1, 2021

Forward and reverse proxies explained

The difference

curl examples

nc examples

Blurring the line

Example pseudocode

Combining a forward and reverse proxy

Related reading

3 comments:

Followers

`curl` examples

`nc` examples