From fe7cf44f26c720fdf6fdac7e2c2117ddce2b22f5 Mon Sep 17 00:00:00 2001 From: Gergely Nagy Date: Fri, 17 Jan 2025 01:28:17 +0100 Subject: [PATCH] Another round of documentation updates Moved the reverse proxy examples out of the README, into `docs/deploying.md`. Also wrote the reverse proxy section for the deployment docs. Signed-off-by: Gergely Nagy --- README.md | 47 +------------------ docs/deploying.md | 114 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 116 insertions(+), 45 deletions(-) diff --git a/README.md b/README.md index 75a36dd..c5a1f69 100644 --- a/README.md +++ b/README.md @@ -18,23 +18,7 @@ This is a tarpit, modeled after [Nepenthes](https://zadzmo.org/code/nepenthes/), `iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage. -To give you an idea, here's a simple Caddy configuration: - - -``` caddyfile -(poison-ai) { - @ai { - header_regexp user-agent (?i:GPTBot|ChatGPT|CCBot|Claude) - } - reverse_proxy @ai 127.0.0.1:42069 -} -blog.example.com { - import poison-ai - reverse_proxy [::1]:8080 -} -``` - -If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md), or visit the [demo][demo:url] to see it in action. +To give you an idea how it works, check the [demo][demo:url], or peek into the [deployment documentation](docs/deployment.md#configuring-the-reverse-proxy). If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md). ## Warning @@ -42,40 +26,13 @@ This is deliberately malicious software, intended to cause harm. Do not deploy i Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it. -## Usage - -Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them. An example for Caddy has been [shown above](#caddy-example), for `nginx`, something similar might look like this: - -``` nginx -map $http_user_agent $badagent { - default 0; - ~*gptbot 1; - ~*chatgpt 1; - ~*ccbot 1; - ~*claude 1; -} -server { - server_name blog.example.com; - if ($badagent) { - rewrite ^ /ai; - } - location /ai { - proxy_pass 127.0.0.1:42069; - } - location / { - proxy_set_header Host $host; - proxy_pass http://[::1]:8080; - } -} -``` - ## Installation `cargo install --path .` Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that. -For more detailed instructions, see [docs/deploying.md](docs/deploying.md). +Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them, see the [deployment documentation](docs/deployment.md). ## Configuration diff --git a/docs/deploying.md b/docs/deploying.md index c55907d..16f01ae 100644 --- a/docs/deploying.md +++ b/docs/deploying.md @@ -109,3 +109,117 @@ Deploying under NixOS is made simple by using the nixosModule provided by this r ``` + +## Configuring the reverse proxy + +While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/). + +As I am a recent Caddy convert, the Caddy example will be more complete - sorry! + +### nginx + +Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this: + +```nginx +server { + server_name blog.example.com; + + location / { + proxy_set_header Host $host; + proxy_pass http://[::1]:8080; + } +} +``` + +To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block: + +``` nginx +map $http_user_agent $badagent { + default 0; + ~*gptbot 1; + ~*chatgpt 1; + ~*ccbot 1; + ~*claude 1; +} +``` + +Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this: + + +```nginx +server { + server_name blog.example.com; + if ($badagent) { + rewrite ^ /ai; + } + location /ai { + proxy_set_header Host $host; + proxy_pass 127.0.0.1:42069; + } + location / { + proxy_set_header Host $host; + proxy_pass http://[::1]:8080; + } +} +``` + +### Caddy + +Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this: + +```caddyfile +blog.example.com { + reverse_proxy [::1]:8080 +} +``` + +To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`: + +```caddyfile +blog.example.com { + @ai { + header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude) + } + reverse_proxy @ai 127.0.0.1:42069 + reverse_proxy [::1]:8080 +} +``` + +We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that: + +```caddyfile +(ai-bots) { + header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude) +} +``` + +This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result: + +```caddyfile +blog.example.com { + rate_limit { + zone ai-bots { + match { + import ai-bots + } + key {remote_host} + events 16 + window 1m + } + } + + @ai { + import ai-bots + } + @not-ai { + not { + import ai-bots + } + } + + reverse_proxy @ai 127.0.0.1:42069 + reverse_proxy @not-ai [::1]:8080 +} +``` + +This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.