iocaine/docs/deploying.md

`iocaine` is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.

Because half the work - the routing of AI crawlers towards `iocaine` - is left up to the reverse proxy, deploying `iocaine` is going to be a two step process: the first step to deploy `iocaine` itself, and another to properly configure the reverse proxy.

Lets start with the first!

## Deploying `iocaine`

How to deploy `iocaine` highly depends on what kind of system you're using. Below, you will find examples for deploying with `systemd`, without it, with `docker`, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main [README.md](../README.md#configuration).

<details>
<summary>Deploying with <code>systemd</code></summary>

See the [`data/iocaine.service`](../data/iocaine.service) for a systemd service template. To use it, install `iocaine` somewhere, and copy the service file to `/etc/systemd/system/`, and edit it so it references the binary you installed, and the configuration file you prepared.

When done editing, you can `systemctl daemon-reload` (as root, of course), followed by `systemctl start iocaine`. If everything went well, you're done.

The provided systemd service tries to restrict the tool as much as possible, and uses `DynamicUser=true`, meaning that no user will need to be created, systemd will take care of it.

</details>

<details>
<summary>Deploying without <code>systemd</code></summary>

To deploy without systemd, the easiest path is to create a dedicated user:

```shell
useradd -m iocaine
```

Then, place the `iocaine` binary and the configuration you prepared into this user's `$HOME`:

```shell
mkdir -p $HOME/iocaine
cp iocaine config.toml $HOME/iocaine/
```

Then, you can run it like this:

```shell
su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml
```

</details>

<details>
<summary>Deploying via Docker</summary>

There's an automatically built container image, for those who may wish to try - or deploy - `iocaine` via Docker. The best way to use it, is likely via `docker compose`. An example of that is provided in [`data/compose.yaml`](../data/compose.yaml).

To use it, place the word list and the training text in `data/container-volume`, and then you can simply start things up like this:

```shell
docker compose up -d
```

Voila!

</details>

<details>
<summary>Deploying on NixOS</summary>

Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the `systemd` service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.

```nix
{
  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
    iocaine = {
      url = "git+https://git.madhouse-project.org/algernon/iocaine.git";
      inputs.nixpkgs.follows = "nixpkgs";
    };
  };

  outputs = { ... }@inputs: {
    nixosConfigurations = {
      your-hostname = inputs.nixpkgs.lib.nixosSystem {
        inherit inputs;
      };
      modules = [
        inputs.iocaine.nixosModules.default
        (
          {
            inputs,
            lib,
            config,
            pkgs,
            ...
          }:
          {
            services.iocaine = {
              enable = true;
              config = {
                sources = {
                  words = "${pkgs.scowl}/share/dict/wamerican.txt";
                  markov = [
                    "/some/path/to/a/training-document.txt"
                  ];
                };
              };
            };
          }
        )
      ];
    };
  };
}
```

</details>

## Configuring the reverse proxy

While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).

As I am a recent Caddy convert, the Caddy example will be more complete - sorry!

### nginx

Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:

```nginx
server {
  server_name blog.example.com;

  location / {
    proxy_set_header Host $host;
    proxy_pass http://[::1]:8080;
  }
}
```

To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:

``` nginx
map $http_user_agent $badagent {
  default   0;
  ~*gptbot  1;
  ~*chatgpt 1;
  ~*ccbot   1;
  ~*claude  1;
}
```

Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy *that* location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:


```nginx
server {
  server_name blog.example.com;
  if ($badagent) {
    rewrite ^ /ai;
  }
  location /ai {
    proxy_set_header Host $host;
    proxy_pass 127.0.0.1:42069;
  }
  location / {
    proxy_set_header Host $host;
    proxy_pass http://[::1]:8080;
  }
}
```

### Caddy

Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:

```caddyfile
blog.example.com {
  reverse_proxy [::1]:8080
}
```

To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:

```caddyfile
blog.example.com {
  @ai {
    header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
  }
  reverse_proxy @ai 127.0.0.1:42069
  reverse_proxy [::1]:8080
}
```

We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:

```caddyfile
(ai-bots) {
  header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
}
```

This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:

```caddyfile
blog.example.com {
  rate_limit {
    zone ai-bots {
      match {
        import ai-bots
      }
      key {remote_host}
      events 16
      window 1m
    }
  }
  
  @ai {
    import ai-bots
  }
  @not-ai {
    not {
      import ai-bots
    }
  }
  
  reverse_proxy @ai 127.0.0.1:42069
  reverse_proxy @not-ai [::1]:8080
}
```

This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.
Major documentation update This adds plenty of documentation: describing what iocaine is, how it does what it does, and how to deploy it in various scenarios. Not complete yet, but reasonably detailed nevertheless. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu> 2025-01-16 23:44:29 +01:00			`iocaine` is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.

			Because half the work - the routing of AI crawlers towards `iocaine` - is left up to the reverse proxy, deploying `iocaine` is going to be a two step process: the first step to deploy `iocaine` itself, and another to properly configure the reverse proxy.

			`Lets start with the first!`

			## Deploying `iocaine`

			How to deploy `iocaine` highly depends on what kind of system you're using. Below, you will find examples for deploying with `systemd`, without it, with `docker`, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main [README.md](../README.md#configuration).

			`<details>`
			`<summary>Deploying with <code>systemd</code></summary>`

			See the [`data/iocaine.service`](../data/iocaine.service) for a systemd service template. To use it, install `iocaine` somewhere, and copy the service file to `/etc/systemd/system/`, and edit it so it references the binary you installed, and the configuration file you prepared.

			When done editing, you can `systemctl daemon-reload` (as root, of course), followed by `systemctl start iocaine`. If everything went well, you're done.

			The provided systemd service tries to restrict the tool as much as possible, and uses `DynamicUser=true`, meaning that no user will need to be created, systemd will take care of it.

			`</details>`

			`<details>`
			`<summary>Deploying without <code>systemd</code></summary>`

			`To deploy without systemd, the easiest path is to create a dedicated user:`

			```shell
			`useradd -m iocaine`
			```

			Then, place the `iocaine` binary and the configuration you prepared into this user's `$HOME`:

			```shell
			`mkdir -p $HOME/iocaine`
			`cp iocaine config.toml $HOME/iocaine/`
			```

			`Then, you can run it like this:`

			```shell
			`su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml`
			```

			`</details>`

			`<details>`
			`<summary>Deploying via Docker</summary>`

			There's an automatically built container image, for those who may wish to try - or deploy - `iocaine` via Docker. The best way to use it, is likely via `docker compose`. An example of that is provided in [`data/compose.yaml`](../data/compose.yaml).

			To use it, place the word list and the training text in `data/container-volume`, and then you can simply start things up like this:

			```shell
			`docker compose up -d`
			```

			`Voila!`

			`</details>`

			`<details>`
			`<summary>Deploying on NixOS</summary>`

			Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the `systemd` service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.

			```nix
			`{`
			`inputs = {`
			`nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";`
			`iocaine = {`
			`url = "git+https://git.madhouse-project.org/algernon/iocaine.git";`
			`inputs.nixpkgs.follows = "nixpkgs";`
			`};`
			`};`

			`outputs = { ... }@inputs: {`
			`nixosConfigurations = {`
			`your-hostname = inputs.nixpkgs.lib.nixosSystem {`
			`inherit inputs;`
			`};`
			`modules = [`
			`inputs.iocaine.nixosModules.default`
			`(`
			`{`
			`inputs,`
			`lib,`
			`config,`
			`pkgs,`
			`...`
			`}:`
			`{`
			`services.iocaine = {`
			`enable = true;`
			`config = {`
			`sources = {`
			`words = "${pkgs.scowl}/share/dict/wamerican.txt";`
			`markov = [`
			`"/some/path/to/a/training-document.txt"`
			`];`
			`};`
			`};`
			`};`
			`}`
			`)`
			`];`
			`};`
			`};`
			`}`
			```

			`</details>`
Another round of documentation updates Moved the reverse proxy examples out of the README, into `docs/deploying.md`. Also wrote the reverse proxy section for the deployment docs. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu> 2025-01-17 01:28:17 +01:00
			`## Configuring the reverse proxy`

			While `iocaine` itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task `iocaine` delegates to the reverse proxy. In the paragraphs below, I will show examples for [nginx](https://nginx.org) and [Caddy](https://caddyserver.com/).

			`As I am a recent Caddy convert, the Caddy example will be more complete - sorry!`

			`### nginx`

			Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `nginx`. Normally, that would look something like this:

			```nginx
			`server {`
			`server_name blog.example.com;`

			`location / {`
			`proxy_set_header Host $host;`
			`proxy_pass http://[::1]:8080;`
			`}`
			`}`
			```

			To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a `map` outside of the `server` block:

			``` nginx
			`map $http_user_agent $badagent {`
			`default 0;`
			`~*gptbot 1;`
			`~*chatgpt 1;`
			`~*ccbot 1;`
			`~*claude 1;`
			`}`
			```

			Within the `server` block, we'll rewrite the URL if find a match on `$badagent`, and the proxy that location through to `iocaine`. The reason we need the `rewrite` is that `nginx` does not support `proxy_pass` within an `if` block. In the end, our `server` block will look like this:


			```nginx
			`server {`
			`server_name blog.example.com;`
			`if ($badagent) {`
			`rewrite ^ /ai;`
			`}`
			`location /ai {`
			`proxy_set_header Host $host;`
			`proxy_pass 127.0.0.1:42069;`
			`}`
			`location / {`
			`proxy_set_header Host $host;`
			`proxy_pass http://[::1]:8080;`
			`}`
			`}`
			```

			`### Caddy`

			Lets assume that we have a site running at `[::1]:8080`, and we want to serve that `Caddy`. Normally, that would look something like this:

			```caddyfile
			`blog.example.com {`
			`reverse_proxy [::1]:8080`
			`}`
			```

			To serve `iocaine`'s garbage to AI visitors, what we need is a matcher, and a matched `reverse_proxy`:

			```caddyfile
			`blog.example.com {`
			`@ai {`
			`header_regexp user-agent (?i:gptbot\|chatgpt\|ccbot\|claude)`
			`}`
			`reverse_proxy @ai 127.0.0.1:42069`
			`reverse_proxy [::1]:8080`
			`}`
			```

			`We can do even better than this, though! We can apply rate limits using [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit)! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:`

			```caddyfile
			`(ai-bots) {`
			`header_regexp user-agent (?i:gptbot\|chatgpt\|ccbot\|claude)`
			`}`
			```

			This is essentially the same thing as the `@ai` matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:

			```caddyfile
			`blog.example.com {`
			`rate_limit {`
			`zone ai-bots {`
			`match {`
			`import ai-bots`
			`}`
			`key {remote_host}`
			`events 16`
			`window 1m`
			`}`
			`}`

			`@ai {`
			`import ai-bots`
			`}`
			`@not-ai {`
			`not {`
			`import ai-bots`
			`}`
			`}`

			`reverse_proxy @ai 127.0.0.1:42069`
			`reverse_proxy @not-ai [::1]:8080`
			`}`
			```

			This does two things: it routes AI user-agents to `iocaine`, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a `Retry-After` header, to encourage them to come back to our little maze.