iocaine/docs/deploying.md
Gergely Nagy fe7cf44f26
Another round of documentation updates
Moved the reverse proxy examples out of the README, into
`docs/deploying.md`. Also wrote the reverse proxy section for the
deployment docs.

Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
2025-01-17 01:28:17 +01:00

7.4 KiB

iocaine is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.

Because half the work - the routing of AI crawlers towards iocaine - is left up to the reverse proxy, deploying iocaine is going to be a two step process: the first step to deploy iocaine itself, and another to properly configure the reverse proxy.

Lets start with the first!

Deploying iocaine

How to deploy iocaine highly depends on what kind of system you're using. Below, you will find examples for deploying with systemd, without it, with docker, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main README.md.

Deploying with systemd

See the data/iocaine.service for a systemd service template. To use it, install iocaine somewhere, and copy the service file to /etc/systemd/system/, and edit it so it references the binary you installed, and the configuration file you prepared.

When done editing, you can systemctl daemon-reload (as root, of course), followed by systemctl start iocaine. If everything went well, you're done.

The provided systemd service tries to restrict the tool as much as possible, and uses DynamicUser=true, meaning that no user will need to be created, systemd will take care of it.

Deploying without systemd

To deploy without systemd, the easiest path is to create a dedicated user:

useradd -m iocaine

Then, place the iocaine binary and the configuration you prepared into this user's $HOME:

mkdir -p $HOME/iocaine
cp iocaine config.toml $HOME/iocaine/

Then, you can run it like this:

su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml
Deploying via Docker

There's an automatically built container image, for those who may wish to try - or deploy - iocaine via Docker. The best way to use it, is likely via docker compose. An example of that is provided in data/compose.yaml.

To use it, place the word list and the training text in data/container-volume, and then you can simply start things up like this:

docker compose up -d

Voila!

Deploying on NixOS

Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the systemd service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.

{
  inputs = {
    nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
    iocaine = {
      url = "git+https://git.madhouse-project.org/algernon/iocaine.git";
      inputs.nixpkgs.follows = "nixpkgs";
    };
  };

  outputs = { ... }@inputs: {
    nixosConfigurations = {
      your-hostname = inputs.nixpkgs.lib.nixosSystem {
        inherit inputs;
      };
      modules = [
        inputs.iocaine.nixosModules.default
        (
          {
            inputs,
            lib,
            config,
            pkgs,
            ...
          }:
          {
            services.iocaine = {
              enable = true;
              config = {
                sources = {
                  words = "${pkgs.scowl}/share/dict/wamerican.txt";
                  markov = [
                    "/some/path/to/a/training-document.txt"
                  ];
                };
              };
            };
          }
        )
      ];
    };
  };
}

Configuring the reverse proxy

While iocaine itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task iocaine delegates to the reverse proxy. In the paragraphs below, I will show examples for nginx and Caddy.

As I am a recent Caddy convert, the Caddy example will be more complete - sorry!

nginx

Lets assume that we have a site running at [::1]:8080, and we want to serve that nginx. Normally, that would look something like this:

server {
  server_name blog.example.com;

  location / {
    proxy_set_header Host $host;
    proxy_pass http://[::1]:8080;
  }
}

To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a map outside of the server block:

map $http_user_agent $badagent {
  default   0;
  ~*gptbot  1;
  ~*chatgpt 1;
  ~*ccbot   1;
  ~*claude  1;
}

Within the server block, we'll rewrite the URL if find a match on $badagent, and the proxy that location through to iocaine. The reason we need the rewrite is that nginx does not support proxy_pass within an if block. In the end, our server block will look like this:

server {
  server_name blog.example.com;
  if ($badagent) {
    rewrite ^ /ai;
  }
  location /ai {
    proxy_set_header Host $host;
    proxy_pass 127.0.0.1:42069;
  }
  location / {
    proxy_set_header Host $host;
    proxy_pass http://[::1]:8080;
  }
}

Caddy

Lets assume that we have a site running at [::1]:8080, and we want to serve that Caddy. Normally, that would look something like this:

blog.example.com {
  reverse_proxy [::1]:8080
}

To serve iocaine's garbage to AI visitors, what we need is a matcher, and a matched reverse_proxy:

blog.example.com {
  @ai {
    header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
  }
  reverse_proxy @ai 127.0.0.1:42069
  reverse_proxy [::1]:8080
}

We can do even better than this, though! We can apply rate limits using caddy-ratelimit! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:

(ai-bots) {
  header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
}

This is essentially the same thing as the @ai matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:

blog.example.com {
  rate_limit {
    zone ai-bots {
      match {
        import ai-bots
      }
      key {remote_host}
      events 16
      window 1m
    }
  }
  
  @ai {
    import ai-bots
  }
  @not-ai {
    not {
      import ai-bots
    }
  }
  
  reverse_proxy @ai 127.0.0.1:42069
  reverse_proxy @not-ai [::1]:8080
}

This does two things: it routes AI user-agents to iocaine, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a Retry-After header, to encourage them to come back to our little maze.