Another round of documentation updates

Moved the reverse proxy examples out of the README, into `docs/deploying.md`. Also wrote the reverse proxy section for the deployment docs. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
2025-03-10 09:18:49 +01:00 · 2025-01-17 01:28:17 +01:00 · 2025-01-17 01:28:17 +01:00 · fe7cf44f26
commit fe7cf44f26
parent 7b2e472e35
2 changed files with 116 additions and 45 deletions
--- a/README.md
+++ b/README.md
@ -18,23 +18,7 @@ This is a tarpit, modeled after [Nepenthes](https://zadzmo.org/code/nepenthes/),

 `iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage.

-To give you an idea, here's a simple Caddy configuration:
-
-<a id="caddy-example"></a>
-``` caddyfile
-(poison-ai) {
-    @ai {
-        header_regexp user-agent (?i:GPTBot|ChatGPT|CCBot|Claude)
-    }
-    reverse_proxy @ai 127.0.0.1:42069
-}
-blog.example.com {
-    import poison-ai
-    reverse_proxy [::1]:8080
-}
-```
-
-If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md), or visit the [demo][demo:url] to see it in action.
+To give you an idea how it works, check the [demo][demo:url], or peek into the [deployment documentation](docs/deployment.md#configuring-the-reverse-proxy). If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md).

 ## Warning

@ -42,40 +26,13 @@ This is deliberately malicious software, intended to cause harm. Do not deploy i

 Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it.

-## Usage
-
-Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them. An example for Caddy has been [shown above](#caddy-example), for `nginx`, something similar might look like this:
-
-``` nginx
-map $http_user_agent $badagent {
-  default   0;
-  ~*gptbot  1;
-  ~*chatgpt 1;
-  ~*ccbot   1;
-  ~*claude  1;
-}
-server {
-  server_name blog.example.com;
-  if ($badagent) {
-    rewrite ^ /ai;
-  }
-  location /ai {
-    proxy_pass 127.0.0.1:42069;
-  }
-  location / {
-    proxy_set_header Host $host;
-    proxy_pass http://[::1]:8080;
-  }
-}
-```
-
 ## Installation

 `cargo install --path .`

 Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that.

-For more detailed instructions, see [docs/deploying.md](docs/deploying.md).
+Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them, see the [deployment documentation](docs/deployment.md).

 ## Configuration

--- a/docs/deploying.md
+++ b/docs/deploying.md
@ -109,3 +109,117 @@ Deploying under NixOS is made simple by using the nixosModule provided by this r
 ```

 </details>
+
+## Configuring the reverse proxy
+
+While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).
+
+As I am a recent Caddy convert, the Caddy example will be more complete - sorry!
+
+### nginx
+
+Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:
+
+```nginx
+server {
+  server_name blog.example.com;
+
+  location / {
+    proxy_set_header Host $host;
+    proxy_pass http://[::1]:8080;
+  }
+}
+```
+
+To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:
+
+``` nginx
+map $http_user_agent $badagent {
+  default   0;
+  ~*gptbot  1;
+  ~*chatgpt 1;
+  ~*ccbot   1;
+  ~*claude  1;
+}
+```
+
+Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:
+
+
+```nginx
+server {
+  server_name blog.example.com;
+  if ($badagent) {
+    rewrite ^ /ai;
+  }
+  location /ai {
+    proxy_set_header Host $host;
+    proxy_pass 127.0.0.1:42069;
+  }
+  location / {
+    proxy_set_header Host $host;
+    proxy_pass http://[::1]:8080;
+  }
+}
+```
+
+### Caddy
+
+Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:
+
+```caddyfile
+blog.example.com {
+  reverse_proxy [::1]:8080
+}
+```
+
+To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:
+
+```caddyfile
+blog.example.com {
+  @ai {
+    header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
+  }
+  reverse_proxy @ai 127.0.0.1:42069
+  reverse_proxy [::1]:8080
+}
+```
+
+We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:
+
+```caddyfile
+(ai-bots) {
+  header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
+}
+```
+
+This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:
+
+```caddyfile
+blog.example.com {
+  rate_limit {
+    zone ai-bots {
+      match {
+        import ai-bots
+      }
+      key {remote_host}
+      events 16
+      window 1m
+    }
+  }
+  
+  @ai {
+    import ai-bots
+  }
+  @not-ai {
+    not {
+      import ai-bots
+    }
+  }
+  
+  reverse_proxy @ai 127.0.0.1:42069
+  reverse_proxy @not-ai [::1]:8080
+}
+```
+
+This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.