`iocaine` is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy. Because half the work - the routing of AI crawlers towards `iocaine` - is left up to the reverse proxy, deploying `iocaine` is going to be a two step process: the first step to deploy `iocaine` itself, and another to properly configure the reverse proxy. Lets start with the first! ## Deploying `iocaine` How to deploy `iocaine` highly depends on what kind of system you're using. Below, you will find examples for deploying with `systemd`, without it, with `docker`, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main [README.md](../README.md#configuration).
Deploying with systemd See the [`data/iocaine.service`](../data/iocaine.service) for a systemd service template. To use it, install `iocaine` somewhere, and copy the service file to `/etc/systemd/system/`, and edit it so it references the binary you installed, and the configuration file you prepared. When done editing, you can `systemctl daemon-reload` (as root, of course), followed by `systemctl start iocaine`. If everything went well, you're done. The provided systemd service tries to restrict the tool as much as possible, and uses `DynamicUser=true`, meaning that no user will need to be created, systemd will take care of it.
Deploying without systemd To deploy without systemd, the easiest path is to create a dedicated user: ```shell useradd -m iocaine ``` Then, place the `iocaine` binary and the configuration you prepared into this user's `$HOME`: ```shell mkdir -p $HOME/iocaine cp iocaine config.toml $HOME/iocaine/ ``` Then, you can run it like this: ```shell su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml ```
Deploying via Docker There's an automatically built container image, for those who may wish to try - or deploy - `iocaine` via Docker. The best way to use it, is likely via `docker compose`. An example of that is provided in [`data/compose.yaml`](../data/compose.yaml). To use it, place the word list and the training text in `data/container-volume`, and then you can simply start things up like this: ```shell docker compose up -d ``` Voila!
Deploying on NixOS Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the `systemd` service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources. ```nix { inputs = { nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable"; iocaine = { url = "git+https://git.madhouse-project.org/algernon/iocaine.git"; inputs.nixpkgs.follows = "nixpkgs"; }; }; outputs = { ... }@inputs: { nixosConfigurations = { your-hostname = inputs.nixpkgs.lib.nixosSystem { inherit inputs; }; modules = [ inputs.iocaine.nixosModules.default ( { inputs, lib, config, pkgs, ... }: { services.iocaine = { enable = true; config = { sources = { words = "${pkgs.scowl}/share/dict/wamerican.txt"; markov = [ "/some/path/to/a/training-document.txt" ]; }; }; }; } ) ]; }; }; } ```
## Configuring the reverse proxy While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/). As I am a recent Caddy convert, the Caddy example will be more complete - sorry! ### nginx Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this: ```nginx server { server_name blog.example.com; location / { proxy_set_header Host $host; proxy_pass http://[::1]:8080; } } ``` To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block: ``` nginx map $http_user_agent $badagent { default 0; ~*gptbot 1; ~*chatgpt 1; ~*ccbot 1; ~*claude 1; } ``` Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this: ```nginx server { server_name blog.example.com; if ($badagent) { rewrite ^ /ai; } location /ai { proxy_set_header Host $host; proxy_pass 127.0.0.1:42069; } location / { proxy_set_header Host $host; proxy_pass http://[::1]:8080; } } ``` ### Caddy Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this: ```caddyfile blog.example.com { reverse_proxy [::1]:8080 } ``` To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`: ```caddyfile blog.example.com { @ai { header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude) } reverse_proxy @ai 127.0.0.1:42069 reverse_proxy [::1]:8080 } ``` We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that: ```caddyfile (ai-bots) { header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude) } ``` This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result: ```caddyfile blog.example.com { rate_limit { zone ai-bots { match { import ai-bots } key {remote_host} events 16 window 1m } } @ai { import ai-bots } @not-ai { not { import ai-bots } } reverse_proxy @ai 127.0.0.1:42069 reverse_proxy @not-ai [::1]:8080 } ``` This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.