mirror of
https://git.madhouse-project.org/algernon/iocaine.git
synced 2025-03-10 09:18:49 +01:00
Moved the reverse proxy examples out of the README, into `docs/deploying.md`. Also wrote the reverse proxy section for the deployment docs. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
225 lines
7.4 KiB
Markdown
225 lines
7.4 KiB
Markdown
`iocaine` is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.
|
|
|
|
Because half the work - the routing of AI crawlers towards `iocaine` - is left up to the reverse proxy, deploying `iocaine` is going to be a two step process: the first step to deploy `iocaine` itself, and another to properly configure the reverse proxy.
|
|
|
|
Lets start with the first!
|
|
|
|
## Deploying `iocaine`
|
|
|
|
How to deploy `iocaine` highly depends on what kind of system you're using. Below, you will find examples for deploying with `systemd`, without it, with `docker`, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main [README.md](../README.md#configuration).
|
|
|
|
<details>
|
|
<summary>Deploying with <code>systemd</code></summary>
|
|
|
|
See the [`data/iocaine.service`](../data/iocaine.service) for a systemd service template. To use it, install `iocaine` somewhere, and copy the service file to `/etc/systemd/system/`, and edit it so it references the binary you installed, and the configuration file you prepared.
|
|
|
|
When done editing, you can `systemctl daemon-reload` (as root, of course), followed by `systemctl start iocaine`. If everything went well, you're done.
|
|
|
|
The provided systemd service tries to restrict the tool as much as possible, and uses `DynamicUser=true`, meaning that no user will need to be created, systemd will take care of it.
|
|
|
|
</details>
|
|
|
|
<details>
|
|
<summary>Deploying without <code>systemd</code></summary>
|
|
|
|
To deploy without systemd, the easiest path is to create a dedicated user:
|
|
|
|
```shell
|
|
useradd -m iocaine
|
|
```
|
|
|
|
Then, place the `iocaine` binary and the configuration you prepared into this user's `$HOME`:
|
|
|
|
```shell
|
|
mkdir -p $HOME/iocaine
|
|
cp iocaine config.toml $HOME/iocaine/
|
|
```
|
|
|
|
Then, you can run it like this:
|
|
|
|
```shell
|
|
su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml
|
|
```
|
|
|
|
</details>
|
|
|
|
<details>
|
|
<summary>Deploying via Docker</summary>
|
|
|
|
There's an automatically built container image, for those who may wish to try - or deploy - `iocaine` via Docker. The best way to use it, is likely via `docker compose`. An example of that is provided in [`data/compose.yaml`](../data/compose.yaml).
|
|
|
|
To use it, place the word list and the training text in `data/container-volume`, and then you can simply start things up like this:
|
|
|
|
```shell
|
|
docker compose up -d
|
|
```
|
|
|
|
Voila!
|
|
|
|
</details>
|
|
|
|
<details>
|
|
<summary>Deploying on NixOS</summary>
|
|
|
|
Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the `systemd` service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.
|
|
|
|
```nix
|
|
{
|
|
inputs = {
|
|
nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
|
|
iocaine = {
|
|
url = "git+https://git.madhouse-project.org/algernon/iocaine.git";
|
|
inputs.nixpkgs.follows = "nixpkgs";
|
|
};
|
|
};
|
|
|
|
outputs = { ... }@inputs: {
|
|
nixosConfigurations = {
|
|
your-hostname = inputs.nixpkgs.lib.nixosSystem {
|
|
inherit inputs;
|
|
};
|
|
modules = [
|
|
inputs.iocaine.nixosModules.default
|
|
(
|
|
{
|
|
inputs,
|
|
lib,
|
|
config,
|
|
pkgs,
|
|
...
|
|
}:
|
|
{
|
|
services.iocaine = {
|
|
enable = true;
|
|
config = {
|
|
sources = {
|
|
words = "${pkgs.scowl}/share/dict/wamerican.txt";
|
|
markov = [
|
|
"/some/path/to/a/training-document.txt"
|
|
];
|
|
};
|
|
};
|
|
};
|
|
}
|
|
)
|
|
];
|
|
};
|
|
};
|
|
}
|
|
```
|
|
|
|
</details>
|
|
|
|
## Configuring the reverse proxy
|
|
|
|
While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).
|
|
|
|
As I am a recent Caddy convert, the Caddy example will be more complete - sorry!
|
|
|
|
### nginx
|
|
|
|
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:
|
|
|
|
```nginx
|
|
server {
|
|
server_name blog.example.com;
|
|
|
|
location / {
|
|
proxy_set_header Host $host;
|
|
proxy_pass http://[::1]:8080;
|
|
}
|
|
}
|
|
```
|
|
|
|
To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:
|
|
|
|
``` nginx
|
|
map $http_user_agent $badagent {
|
|
default 0;
|
|
~*gptbot 1;
|
|
~*chatgpt 1;
|
|
~*ccbot 1;
|
|
~*claude 1;
|
|
}
|
|
```
|
|
|
|
Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:
|
|
|
|
|
|
```nginx
|
|
server {
|
|
server_name blog.example.com;
|
|
if ($badagent) {
|
|
rewrite ^ /ai;
|
|
}
|
|
location /ai {
|
|
proxy_set_header Host $host;
|
|
proxy_pass 127.0.0.1:42069;
|
|
}
|
|
location / {
|
|
proxy_set_header Host $host;
|
|
proxy_pass http://[::1]:8080;
|
|
}
|
|
}
|
|
```
|
|
|
|
### Caddy
|
|
|
|
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:
|
|
|
|
```caddyfile
|
|
blog.example.com {
|
|
reverse_proxy [::1]:8080
|
|
}
|
|
```
|
|
|
|
To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:
|
|
|
|
```caddyfile
|
|
blog.example.com {
|
|
@ai {
|
|
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
|
|
}
|
|
reverse_proxy @ai 127.0.0.1:42069
|
|
reverse_proxy [::1]:8080
|
|
}
|
|
```
|
|
|
|
We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:
|
|
|
|
```caddyfile
|
|
(ai-bots) {
|
|
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
|
|
}
|
|
```
|
|
|
|
This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:
|
|
|
|
```caddyfile
|
|
blog.example.com {
|
|
rate_limit {
|
|
zone ai-bots {
|
|
match {
|
|
import ai-bots
|
|
}
|
|
key {remote_host}
|
|
events 16
|
|
window 1m
|
|
}
|
|
}
|
|
|
|
@ai {
|
|
import ai-bots
|
|
}
|
|
@not-ai {
|
|
not {
|
|
import ai-bots
|
|
}
|
|
}
|
|
|
|
reverse_proxy @ai 127.0.0.1:42069
|
|
reverse_proxy @not-ai [::1]:8080
|
|
}
|
|
```
|
|
|
|
This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.
|