Another round of documentation updates

Moved the reverse proxy examples out of the README, into
`docs/deploying.md`. Also wrote the reverse proxy section for the
deployment docs.

Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
This commit is contained in:
Gergely Nagy 2025-01-17 01:28:17 +01:00
parent 7b2e472e35
commit fe7cf44f26
No known key found for this signature in database
2 changed files with 116 additions and 45 deletions

View file

@ -18,23 +18,7 @@ This is a tarpit, modeled after [Nepenthes](https://zadzmo.org/code/nepenthes/),
`iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage. `iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage.
To give you an idea, here's a simple Caddy configuration: To give you an idea how it works, check the [demo][demo:url], or peek into the [deployment documentation](docs/deployment.md#configuring-the-reverse-proxy). If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md).
<a id="caddy-example"></a>
``` caddyfile
(poison-ai) {
@ai {
header_regexp user-agent (?i:GPTBot|ChatGPT|CCBot|Claude)
}
reverse_proxy @ai 127.0.0.1:42069
}
blog.example.com {
import poison-ai
reverse_proxy [::1]:8080
}
```
If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md), or visit the [demo][demo:url] to see it in action.
## Warning ## Warning
@ -42,40 +26,13 @@ This is deliberately malicious software, intended to cause harm. Do not deploy i
Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it. Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it.
## Usage
Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them. An example for Caddy has been [shown above](#caddy-example), for `nginx`, something similar might look like this:
``` nginx
map $http_user_agent $badagent {
default 0;
~*gptbot 1;
~*chatgpt 1;
~*ccbot 1;
~*claude 1;
}
server {
server_name blog.example.com;
if ($badagent) {
rewrite ^ /ai;
}
location /ai {
proxy_pass 127.0.0.1:42069;
}
location / {
proxy_set_header Host $host;
proxy_pass http://[::1]:8080;
}
}
```
## Installation ## Installation
`cargo install --path .` `cargo install --path .`
Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that. Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that.
For more detailed instructions, see [docs/deploying.md](docs/deploying.md). Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them, see the [deployment documentation](docs/deployment.md).
## Configuration ## Configuration

View file

@ -109,3 +109,117 @@ Deploying under NixOS is made simple by using the nixosModule provided by this r
``` ```
</details> </details>
## Configuring the reverse proxy
While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).
As I am a recent Caddy convert, the Caddy example will be more complete - sorry!
### nginx
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:
```nginx
server {
server_name blog.example.com;
location / {
proxy_set_header Host $host;
proxy_pass http://[::1]:8080;
}
}
```
To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:
``` nginx
map $http_user_agent $badagent {
default 0;
~*gptbot 1;
~*chatgpt 1;
~*ccbot 1;
~*claude 1;
}
```
Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:
```nginx
server {
server_name blog.example.com;
if ($badagent) {
rewrite ^ /ai;
}
location /ai {
proxy_set_header Host $host;
proxy_pass 127.0.0.1:42069;
}
location / {
proxy_set_header Host $host;
proxy_pass http://[::1]:8080;
}
}
```
### Caddy
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:
```caddyfile
blog.example.com {
reverse_proxy [::1]:8080
}
```
To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:
```caddyfile
blog.example.com {
@ai {
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
}
reverse_proxy @ai 127.0.0.1:42069
reverse_proxy [::1]:8080
}
```
We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:
```caddyfile
(ai-bots) {
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
}
```
This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:
```caddyfile
blog.example.com {
rate_limit {
zone ai-bots {
match {
import ai-bots
}
key {remote_host}
events 16
window 1m
}
}
@ai {
import ai-bots
}
@not-ai {
not {
import ai-bots
}
}
reverse_proxy @ai 127.0.0.1:42069
reverse_proxy @not-ai [::1]:8080
}
```
This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.