Moved the reverse proxy examples out of the README, into `docs/deploying.md`. Also wrote the reverse proxy section for the deployment docs. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
7.4 KiB
iocaine
is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.
Because half the work - the routing of AI crawlers towards iocaine
- is left up to the reverse proxy, deploying iocaine
is going to be a two step process: the first step to deploy iocaine
itself, and another to properly configure the reverse proxy.
Lets start with the first!
Deploying iocaine
How to deploy iocaine
highly depends on what kind of system you're using. Below, you will find examples for deploying with systemd
, without it, with docker
, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main README.md.
Deploying with systemd
See the data/iocaine.service
for a systemd service template. To use it, install iocaine
somewhere, and copy the service file to /etc/systemd/system/
, and edit it so it references the binary you installed, and the configuration file you prepared.
When done editing, you can systemctl daemon-reload
(as root, of course), followed by systemctl start iocaine
. If everything went well, you're done.
The provided systemd service tries to restrict the tool as much as possible, and uses DynamicUser=true
, meaning that no user will need to be created, systemd will take care of it.
Deploying without systemd
To deploy without systemd, the easiest path is to create a dedicated user:
useradd -m iocaine
Then, place the iocaine
binary and the configuration you prepared into this user's $HOME
:
mkdir -p $HOME/iocaine
cp iocaine config.toml $HOME/iocaine/
Then, you can run it like this:
su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml
Deploying via Docker
There's an automatically built container image, for those who may wish to try - or deploy - iocaine
via Docker. The best way to use it, is likely via docker compose
. An example of that is provided in data/compose.yaml
.
To use it, place the word list and the training text in data/container-volume
, and then you can simply start things up like this:
docker compose up -d
Voila!
Deploying on NixOS
Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the systemd
service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.
{
inputs = {
nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
iocaine = {
url = "git+https://git.madhouse-project.org/algernon/iocaine.git";
inputs.nixpkgs.follows = "nixpkgs";
};
};
outputs = { ... }@inputs: {
nixosConfigurations = {
your-hostname = inputs.nixpkgs.lib.nixosSystem {
inherit inputs;
};
modules = [
inputs.iocaine.nixosModules.default
(
{
inputs,
lib,
config,
pkgs,
...
}:
{
services.iocaine = {
enable = true;
config = {
sources = {
words = "${pkgs.scowl}/share/dict/wamerican.txt";
markov = [
"/some/path/to/a/training-document.txt"
];
};
};
};
}
)
];
};
};
}
Configuring the reverse proxy
While iocaine
itself is good at generating garbage, it will do so indiscriminately. That's not what we want. We want it to generate garbage only when facing unwanted crawlers, and that's a task iocaine
delegates to the reverse proxy. In the paragraphs below, I will show examples for nginx and Caddy.
As I am a recent Caddy convert, the Caddy example will be more complete - sorry!
nginx
Lets assume that we have a site running at [::1]:8080
, and we want to serve that nginx
. Normally, that would look something like this:
server {
server_name blog.example.com;
location / {
proxy_set_header Host $host;
proxy_pass http://[::1]:8080;
}
}
To serve something different for AI user agents, the idea is to create a mapping between user-agent and badness, such that AI agents will evaluate to a truthy value, while unmatched against will default to a false-y one. We can do this with a map
outside of the server
block:
map $http_user_agent $badagent {
default 0;
~*gptbot 1;
~*chatgpt 1;
~*ccbot 1;
~*claude 1;
}
Within the server
block, we'll rewrite the URL if find a match on $badagent
, and the proxy that location through to iocaine
. The reason we need the rewrite
is that nginx
does not support proxy_pass
within an if
block. In the end, our server
block will look like this:
server {
server_name blog.example.com;
if ($badagent) {
rewrite ^ /ai;
}
location /ai {
proxy_set_header Host $host;
proxy_pass 127.0.0.1:42069;
}
location / {
proxy_set_header Host $host;
proxy_pass http://[::1]:8080;
}
}
Caddy
Lets assume that we have a site running at [::1]:8080
, and we want to serve that Caddy
. Normally, that would look something like this:
blog.example.com {
reverse_proxy [::1]:8080
}
To serve iocaine
's garbage to AI visitors, what we need is a matcher, and a matched reverse_proxy
:
blog.example.com {
@ai {
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
}
reverse_proxy @ai 127.0.0.1:42069
reverse_proxy [::1]:8080
}
We can do even better than this, though! We can apply rate limits using caddy-ratelimit! Unfortunately, that leads to a slightly more complex configuration, involving a bit of repetition, but one we can mitigate with a snippet. Lets start with that:
(ai-bots) {
header_regexp user-agent (?i:gptbot|chatgpt|ccbot|claude)
}
This is essentially the same thing as the @ai
matcher, lifted out. The reason it had to be lifted out, is because the same matcher will have to be reused in slightly differring contexts, including ones where I can't use a named matcher. It sounds more complicated than it is, really, so let me show the final result:
blog.example.com {
rate_limit {
zone ai-bots {
match {
import ai-bots
}
key {remote_host}
events 16
window 1m
}
}
@ai {
import ai-bots
}
@not-ai {
not {
import ai-bots
}
}
reverse_proxy @ai 127.0.0.1:42069
reverse_proxy @not-ai [::1]:8080
}
This does two things: it routes AI user-agents to iocaine
, and applies a 16 request / minute rate limit to the remote hosts these originated from. If the rate limit is exceeded, Caddy will return a HTTP 429 ("Too Many Requests"), with a Retry-After
header, to encourage them to come back to our little maze.