mirror of
https://git.madhouse-project.org/algernon/iocaine.git
synced 2025-03-10 09:18:49 +01:00
Another round of documentation updates
Moved the reverse proxy examples out of the README, into `docs/deploying.md`. Also wrote the reverse proxy section for the deployment docs. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
This commit is contained in:
parent
7b2e472e35
commit
fe7cf44f26
2 changed files with 116 additions and 45 deletions
47
README.md
47
README.md
|
@ -18,23 +18,7 @@ This is a tarpit, modeled after [Nepenthes](https://zadzmo.org/code/nepenthes/),
|
||||||
|
|
||||||
`iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage.
|
`iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage.
|
||||||
|
|
||||||
To give you an idea, here's a simple Caddy configuration:
|
To give you an idea how it works, check the [demo][demo:url], or peek into the [deployment documentation](docs/deployment.md#configuring-the-reverse-proxy). If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md).
|
||||||
|
|
||||||
<a id="caddy-example"></a>
|
|
||||||
``` caddyfile
|
|
||||||
(poison-ai) {
|
|
||||||
@ai {
|
|
||||||
header_regexp user-agent (?i:GPTBot|ChatGPT|CCBot|Claude)
|
|
||||||
}
|
|
||||||
reverse_proxy @ai 127.0.0.1:42069
|
|
||||||
}
|
|
||||||
blog.example.com {
|
|
||||||
import poison-ai
|
|
||||||
reverse_proxy [::1]:8080
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md), or visit the [demo][demo:url] to see it in action.
|
|
||||||
|
|
||||||
## Warning
|
## Warning
|
||||||
|
|
||||||
|
@ -42,40 +26,13 @@ This is deliberately malicious software, intended to cause harm. Do not deploy i
|
||||||
|
|
||||||
Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it.
|
Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it.
|
||||||
|
|
||||||
## Usage
|
|
||||||
|
|
||||||
Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them. An example for Caddy has been [shown above](#caddy-example), for `nginx`, something similar might look like this:
|
|
||||||
|
|
||||||
``` nginx
|
|
||||||
map $http_user_agent $badagent {
|
|
||||||
default 0;
|
|
||||||
~*gptbot 1;
|
|
||||||
~*chatgpt 1;
|
|
||||||
~*ccbot 1;
|
|
||||||
~*claude 1;
|
|
||||||
}
|
|
||||||
server {
|
|
||||||
server_name blog.example.com;
|
|
||||||
if ($badagent) {
|
|
||||||
rewrite ^ /ai;
|
|
||||||
}
|
|
||||||
location /ai {
|
|
||||||
proxy_pass 127.0.0.1:42069;
|
|
||||||
}
|
|
||||||
location / {
|
|
||||||
proxy_set_header Host $host;
|
|
||||||
proxy_pass http://[::1]:8080;
|
|
||||||
}
|
|
||||||
}
|
|
||||||
```
|
|
||||||
|
|
||||||
## Installation
|
## Installation
|
||||||
|
|
||||||
`cargo install --path .`
|
`cargo install --path .`
|
||||||
|
|
||||||
Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that.
|
Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that.
|
||||||
|
|
||||||
For more detailed instructions, see [docs/deploying.md](docs/deploying.md).
|
Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them, see the [deployment documentation](docs/deployment.md).
|
||||||
|
|
||||||
## Configuration
|
## Configuration
|
||||||
|
|
||||||
|
|
|
@ -109,3 +109,117 @@ Deploying under NixOS is made simple by using the nixosModule provided by this r
|
||||||
```
|
```
|
||||||
|
|
||||||
</details>
|
</details>
|
||||||
|
|
||||||
|
## Configuring the reverse proxy
|
||||||
|
|
||||||
|
While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).
|
||||||
|
|
||||||
|
As I am a recent Caddy convert, the Caddy example will be more complete - sorry!
|
||||||
|
|
||||||
|
### nginx
|
||||||
|
|
||||||
|
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
server {
|
||||||
|
server_name blog.example.com;
|
||||||
|
|
||||||
|
location / {
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_pass http://[::1]:8080;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:
|
||||||
|
|
||||||
|
``` nginx
|
||||||
|
map $http_user_agent $badagent {
|
||||||
|
default 0;
|
||||||
|
~*gptbot 1;
|
||||||
|
~*chatgpt 1;
|
||||||
|
~*ccbot 1;
|
||||||
|
~*claude 1;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:
|
||||||
|
|
||||||
|
|
||||||
|
```nginx
|
||||||
|
server {
|
||||||
|
server_name blog.example.com;
|
||||||
|
if ($badagent) {
|
||||||
|
rewrite ^ /ai;
|
||||||
|
}
|
||||||
|
location /ai {
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_pass 127.0.0.1:42069;
|
||||||
|
}
|
||||||
|
location / {
|
||||||
|
proxy_set_header Host $host;
|
||||||
|
proxy_pass http://[::1]:8080;
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Caddy
|
||||||
|
|
||||||
|
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:
|
||||||
|
|
||||||
|
```caddyfile
|
||||||
|
blog.example.com {
|
||||||
|
reverse_proxy [::1]:8080
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:
|
||||||
|
|
||||||
|
```caddyfile
|
||||||
|
blog.example.com {
|
||||||
|
@ai {
|
||||||
|
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
|
||||||
|
}
|
||||||
|
reverse_proxy @ai 127.0.0.1:42069
|
||||||
|
reverse_proxy [::1]:8080
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:
|
||||||
|
|
||||||
|
```caddyfile
|
||||||
|
(ai-bots) {
|
||||||
|
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:
|
||||||
|
|
||||||
|
```caddyfile
|
||||||
|
blog.example.com {
|
||||||
|
rate_limit {
|
||||||
|
zone ai-bots {
|
||||||
|
match {
|
||||||
|
import ai-bots
|
||||||
|
}
|
||||||
|
key {remote_host}
|
||||||
|
events 16
|
||||||
|
window 1m
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
@ai {
|
||||||
|
import ai-bots
|
||||||
|
}
|
||||||
|
@not-ai {
|
||||||
|
not {
|
||||||
|
import ai-bots
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
reverse_proxy @ai 127.0.0.1:42069
|
||||||
|
reverse_proxy @not-ai [::1]:8080
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.
|
||||||
|
|
Loading…
Reference in a new issue