`iocaine` is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.
Because half the work - the routing of AI crawlers towards `iocaine` - is left up to the reverse proxy, deploying `iocaine` is going to be a two step process: the first step to deploy `iocaine` itself, and another to properly configure the reverse proxy.
Lets start with the first!
## Deploying `iocaine`
How to deploy `iocaine` highly depends on what kind of system you're using. Below, you will find examples for deploying with `systemd`, without it, with `docker`, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main [README.md](../README.md#configuration).
<details>
<summary>Deploying with <code>systemd</code></summary>
See the [`data/iocaine.service`](../data/iocaine.service) for a systemd service template. To use it, install `iocaine` somewhere, and copy the service file to `/etc/systemd/system/`, and edit it so it references the binary you installed, and the configuration file you prepared.
When done editing, you can `systemctl daemon-reload` (as root, of course), followed by `systemctl start iocaine`. If everything went well, you're done.
The provided systemd service tries to restrict the tool as much as possible, and uses `DynamicUser=true`, meaning that no user will need to be created, systemd will take care of it.
</details>
<details>
<summary>Deploying without <code>systemd</code></summary>
To deploy without systemd, the easiest path is to create a dedicated user:
```shell
useradd -m iocaine
```
Then, place the `iocaine` binary and the configuration you prepared into this user's `$HOME`:
```shell
mkdir -p $HOME/iocaine
cp iocaine config.toml $HOME/iocaine/
```
Then, you can run it like this:
```shell
su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml
```
</details>
<details>
<summary>Deploying via Docker</summary>
There's an automatically built container image, for those who may wish to try - or deploy - `iocaine` via Docker. The best way to use it, is likely via `docker compose`. An example of that is provided in [`data/compose.yaml`](../data/compose.yaml).
To use it, place the word list and the training text in `data/container-volume`, and then you can simply start things up like this:
```shell
docker compose up -d
```
Voila!
</details>
<details>
<summary>Deploying on NixOS</summary>
Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the `systemd` service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.
While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).
As I am a recent Caddy convert, the Caddy example will be more complete - sorry!
### nginx
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:
```nginx
server {
server_name blog.example.com;
location / {
proxy_set_header Host $host;
proxy_pass http://[::1]:8080;
}
}
```
To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:
``` nginx
map $http_user_agent $badagent {
default 0;
~*gptbot 1;
~*chatgpt 1;
~*ccbot 1;
~*claude 1;
}
```
Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:
```nginx
server {
server_name blog.example.com;
if ($badagent) {
rewrite ^ /ai;
}
location /ai {
proxy_set_header Host $host;
proxy_pass 127.0.0.1:42069;
}
location / {
proxy_set_header Host $host;
proxy_pass http://[::1]:8080;
}
}
```
### Caddy
Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:
```caddyfile
blog.example.com {
reverse_proxy [::1]:8080
}
```
To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:
We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:
This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:
```caddyfile
blog.example.com {
rate_limit {
zone ai-bots {
match {
import ai-bots
}
key {remote_host}
events 16
window 1m
}
}
@ai {
import ai-bots
}
@not-ai {
not {
import ai-bots
}
}
reverse_proxy @ai 127.0.0.1:42069
reverse_proxy @not-ai [::1]:8080
}
```
This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.