`iocaine` is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.

Because half the work - the routing of AI crawlers towards `iocaine` - is left up to the reverse proxy, deploying `iocaine` is going to be a two step process: the first step to deploy `iocaine` itself, and another to properly configure the reverse proxy.

Lets start with the first!

## Deploying `iocaine`

How to deploy `iocaine` highly depends on what kind of system you're using. Below, you will find examples for deploying with `systemd`, without it, with `docker`, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main [README.md](../README.md#configuration).

<details>
<summary>Deploying with <code>systemd</code></summary>

See the [`data/iocaine.service`](../data/iocaine.service) for a systemd service template. To use it, install `iocaine` somewhere, and copy the service file to `/etc/systemd/system/`, and edit it so it references the binary you installed, and the configuration file you prepared.

When done editing, you can `systemctl daemon-reload` (as root, of course), followed by `systemctl start iocaine`. If everything went well, you're done.

The provided systemd service tries to restrict the tool as much as possible, and uses `DynamicUser=true`, meaning that no user will need to be created, systemd will take care of it.

</details>

<details>
<summary>Deploying without <code>systemd</code></summary>

To deploy without systemd, the easiest path is to create a dedicated user:

```shell
useradd -m iocaine
```

Then, place the `iocaine` binary and the configuration you prepared into this user's `$HOME`:

```shell
mkdir -p $HOME/iocaine
cp iocaine config.toml $HOME/iocaine/
```

Then, you can run it like this:

```shell
su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml
```

</details>

<details>
<summary>Deploying via Docker</summary>

There's an automatically built container image, for those who may wish to try - or deploy - `iocaine` via Docker. The best way to use it, is likely via `docker compose`. An example of that is provided in [`data/compose.yaml`](../data/compose.yaml).

To use it, place the word list and the training text in `data/container-volume`, and then you can simply start things up like this:

```shell
docker compose up -d
```

Voila!

</details>

<details>
<summary>Deploying on NixOS</summary>

Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the `systemd` service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.

```nix
{
  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
    iocaine = {
      url = "git+https://git.madhouse-project.org/algernon/iocaine.git";
      inputs.nixpkgs.follows = "nixpkgs";
    };
  };

  outputs = { ... }@inputs: {
    nixosConfigurations = {
      your-hostname = inputs.nixpkgs.lib.nixosSystem {
        inherit inputs;
      };
      modules = [
        inputs.iocaine.nixosModules.default
        (
          {
            inputs,
            lib,
            config,
            pkgs,
            ...
          }:
          {
            services.iocaine = {
              enable = true;
              config = {
                sources = {
                  words = "${pkgs.scowl}/share/dict/wamerican.txt";
                  markov = [
                    "/some/path/to/a/training-document.txt"
                  ];
                };
              };
            };
          }
        )
      ];
    };
  };
}
```

</details>

## Configuring the reverse proxy

While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).

As I am a recent Caddy convert, the Caddy example will be more complete - sorry!

### nginx

Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:

```nginx
server {
  server_name blog.example.com;

  location / {
    proxy_set_header Host $host;
    proxy_pass http://[::1]:8080;
  }
}
```

To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:

``` nginx
map $http_user_agent $badagent {
  default   0;
  ~*gptbot  1;
  ~*chatgpt 1;
  ~*ccbot   1;
  ~*claude  1;
}
```

Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:


```nginx
server {
  server_name blog.example.com;
  if ($badagent) {
    rewrite ^ /ai;
  }
  location /ai {
    proxy_set_header Host $host;
    proxy_pass 127.0.0.1:42069;
  }
  location / {
    proxy_set_header Host $host;
    proxy_pass http://[::1]:8080;
  }
}
```

### Caddy

Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:

```caddyfile
blog.example.com {
  reverse_proxy [::1]:8080
}
```

To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:

```caddyfile
blog.example.com {
  @ai {
    header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
  }
  reverse_proxy @ai 127.0.0.1:42069
  reverse_proxy [::1]:8080
}
```

We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:

```caddyfile
(ai-bots) {
  header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
}
```

This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:

```caddyfile
blog.example.com {
  rate_limit {
    zone ai-bots {
      match {
        import ai-bots
      }
      key {remote_host}
      events 16
      window 1m
    }
  }
  
  @ai {
    import ai-bots
  }
  @not-ai {
    not {
      import ai-bots
    }
  }
  
  reverse_proxy @ai 127.0.0.1:42069
  reverse_proxy @not-ai [::1]:8080
}
```

This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.