mirror of
https://git.madhouse-project.org/algernon/iocaine.git
synced 2025-03-10 09:18:49 +01:00
Another round of documentation updates
Moved the reverse proxy examples out of the README, into `docs/deploying.md`. Also wrote the reverse proxy section for the deployment docs. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
This commit is contained in:
parent
7b2e472e35
commit
fe7cf44f26
2 changed files with 116 additions and 45 deletions
47
README.md
47
README.md
|
@ -18,23 +18,7 @@ This is a tarpit, modeled after [Nepenthes](https://zadzmo.org/code/nepenthes/),
|
|||
|
||||
`iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage.
|
||||
|
||||
To give you an idea, here's a simple Caddy configuration:
|
||||
|
||||
<a id="caddy-example"></a>
|
||||
``` caddyfile
|
||||
(poison-ai) {
|
||||
@ai {
|
||||
header_regexp user-agent (?i:GPTBot|ChatGPT|CCBot|Claude)
|
||||
}
|
||||
reverse_proxy @ai 127.0.0.1:42069
|
||||
}
|
||||
blog.example.com {
|
||||
import poison-ai
|
||||
reverse_proxy [::1]:8080
|
||||
}
|
||||
```
|
||||
|
||||
If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md), or visit the [demo][demo:url] to see it in action.
|
||||
To give you an idea how it works, check the [demo][demo:url], or peek into the [deployment documentation](docs/deployment.md#configuring-the-reverse-proxy). If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md).
|
||||
|
||||
## Warning
|
||||
|
||||
|
@ -42,40 +26,13 @@ This is deliberately malicious software, intended to cause harm. Do not deploy i
|
|||
|
||||
Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it.
|
||||
|
||||
## Usage
|
||||
|
||||
Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them. An example for Caddy has been [shown above](#caddy-example), for `nginx`, something similar might look like this:
|
||||
|
||||
``` nginx
|
||||
map $http_user_agent $badagent {
|
||||
default 0;
|
||||
~*gptbot 1;
|
||||
~*chatgpt 1;
|
||||
~*ccbot 1;
|
||||
~*claude 1;
|
||||
}
|
||||
server {
|
||||
server_name blog.example.com;
|
||||
if ($badagent) {
|
||||
rewrite ^ /ai;
|
||||
}
|
||||
location /ai {
|
||||
proxy_pass 127.0.0.1:42069;
|
||||
}
|
||||
location / {
|
||||
proxy_set_header Host $host;
|
||||
proxy_pass http://[::1]:8080;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
`cargo install --path .`
|
||||
|
||||
Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that.
|
||||
|
||||
For more detailed instructions, see [docs/deploying.md](docs/deploying.md).
|
||||
Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them, see the [deployment documentation](docs/deployment.md).
|
||||
|
||||
## Configuration
|
||||
|
||||
|
|
|
@ -109,3 +109,117 @@ Deploying under NixOS is made simple by using the nixosModule provided by this r
|
|||
```
|
||||
|
||||
</details>
|
||||
|
||||
## Configuring the reverse proxy
|
||||
|
||||
While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).
|
||||
|
||||
As I am a recent Caddy convert, the Caddy example will be more complete - sorry!
|
||||
|
||||
### nginx
|
||||
|
||||
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:
|
||||
|
||||
```nginx
|
||||
server {
|
||||
server_name blog.example.com;
|
||||
|
||||
location / {
|
||||
proxy_set_header Host $host;
|
||||
proxy_pass http://[::1]:8080;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:
|
||||
|
||||
``` nginx
|
||||
map $http_user_agent $badagent {
|
||||
default 0;
|
||||
~*gptbot 1;
|
||||
~*chatgpt 1;
|
||||
~*ccbot 1;
|
||||
~*claude 1;
|
||||
}
|
||||
```
|
||||
|
||||
Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:
|
||||
|
||||
|
||||
```nginx
|
||||
server {
|
||||
server_name blog.example.com;
|
||||
if ($badagent) {
|
||||
rewrite ^ /ai;
|
||||
}
|
||||
location /ai {
|
||||
proxy_set_header Host $host;
|
||||
proxy_pass 127.0.0.1:42069;
|
||||
}
|
||||
location / {
|
||||
proxy_set_header Host $host;
|
||||
proxy_pass http://[::1]:8080;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Caddy
|
||||
|
||||
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:
|
||||
|
||||
```caddyfile
|
||||
blog.example.com {
|
||||
reverse_proxy [::1]:8080
|
||||
}
|
||||
```
|
||||
|
||||
To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:
|
||||
|
||||
```caddyfile
|
||||
blog.example.com {
|
||||
@ai {
|
||||
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
|
||||
}
|
||||
reverse_proxy @ai 127.0.0.1:42069
|
||||
reverse_proxy [::1]:8080
|
||||
}
|
||||
```
|
||||
|
||||
We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:
|
||||
|
||||
```caddyfile
|
||||
(ai-bots) {
|
||||
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
|
||||
}
|
||||
```
|
||||
|
||||
This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:
|
||||
|
||||
```caddyfile
|
||||
blog.example.com {
|
||||
rate_limit {
|
||||
zone ai-bots {
|
||||
match {
|
||||
import ai-bots
|
||||
}
|
||||
key {remote_host}
|
||||
events 16
|
||||
window 1m
|
||||
}
|
||||
}
|
||||
|
||||
@ai {
|
||||
import ai-bots
|
||||
}
|
||||
@not-ai {
|
||||
not {
|
||||
import ai-bots
|
||||
}
|
||||
}
|
||||
|
||||
reverse_proxy @ai 127.0.0.1:42069
|
||||
reverse_proxy @not-ai [::1]:8080
|
||||
}
|
||||
```
|
||||
|
||||
This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.
|
||||
|
|
Loading…
Reference in a new issue