mirror of
https://git.madhouse-project.org/algernon/iocaine.git
synced 2025-01-22 18:46:17 +01:00
Major documentation update
This adds plenty of documentation: describing what iocaine is, how it does what it does, and how to deploy it in various scenarios. Not complete yet, but reasonably detailed nevertheless. Signed-off-by: Gergely Nagy <me@gergo.csillger.hu>
This commit is contained in:
parent
65cc8365d6
commit
7b2e472e35
6 changed files with 309 additions and 1 deletions
133
README.md
133
README.md
|
@ -1,4 +1,137 @@
|
|||
iocaine
|
||||
=======
|
||||
|
||||
[![Build status][ci:badge]][ci:url]
|
||||
[![Container image][oci:badge]][oci:url]
|
||||
[![Demo][demo:badge]][demo:url]
|
||||
|
||||
[ci:badge]: https://git.madhouse-project.org/algernon/iocaine/actions/workflows/build.yaml/badge.svg?style=for-the-badge&label=CI
|
||||
[ci:url]: https://git.madhouse-project.org/algernon/iocaine/actions/workflows/build.yaml/runs/latest
|
||||
[oci:badge]: https://img.shields.io/badge/container-latest-blue?style=for-the-badge
|
||||
[oci:url]: https://git.madhouse-project.org/algernon/-/packages/container/iocaine/latest
|
||||
[demo:badge]: https://img.shields.io/badge/demo-iocaine-seagreen?style=for-the-badge
|
||||
[demo:url]: https://poison.madhouse-project.org/
|
||||
|
||||
> The deadliest poison known to AI.
|
||||
|
||||
This is a tarpit, modeled after [Nepenthes](https://zadzmo.org/code/nepenthes/), intended to catch unwelcome web crawlers, but with a slightly different, more aggressive intended usage scenario. The core idea is to configure a reverse proxy to serve content generated by `iocaine` to AI crawlers, but normal content to every other visitor. This differs from Nepenthes, where the idea is to link to it, and trap crawlers that way. Not with `iocane`, where the trap is laid by the reverse proxy.
|
||||
|
||||
`iocaine` does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. `iocaine` is *purely* about generating garbage.
|
||||
|
||||
To give you an idea, here's a simple Caddy configuration:
|
||||
|
||||
<a id="caddy-example"></a>
|
||||
``` caddyfile
|
||||
(poison-ai) {
|
||||
@ai {
|
||||
header_regexp user-agent (?i:GPTBot|ChatGPT|CCBot|Claude)
|
||||
}
|
||||
reverse_proxy @ai 127.0.0.1:42069
|
||||
}
|
||||
blog.example.com {
|
||||
import poison-ai
|
||||
reverse_proxy [::1]:8080
|
||||
}
|
||||
```
|
||||
|
||||
If you wish to know more about how this works, see [docs/how-it-works.md](docs/how-it-works.md), or visit the [demo][demo:url] to see it in action.
|
||||
|
||||
## Warning
|
||||
|
||||
This is deliberately malicious software, intended to cause harm. Do not deploy if you aren't fully comfortable with what you are doing. LLM scrapers are relentless and brutal, they *will* place additional burden on your server, even if you only serve static content. With `iocaine`, there's going to be increased computing power used. It's *highly* recommended to implement rate limits at the reverse proxy level, such as with the [caddy-ratelimit](https://github.com/mholt/caddy-ratelimit) plugin, if using Caddy.
|
||||
|
||||
Entrapment is done by the reverse proxy. Anything that ends up being served by `iocaine` will be trapped there: there are no outgoing links. Be careful what you route towards it.
|
||||
|
||||
## Usage
|
||||
|
||||
Expected usage is to hide the tarpit behind a reverse proxy like `nginx` or `Caddy`, and delegate the trapping to them. An example for Caddy has been [shown above](#caddy-example), for `nginx`, something similar might look like this:
|
||||
|
||||
``` nginx
|
||||
map $http_user_agent $badagent {
|
||||
default 0;
|
||||
~*gptbot 1;
|
||||
~*chatgpt 1;
|
||||
~*ccbot 1;
|
||||
~*claude 1;
|
||||
}
|
||||
server {
|
||||
server_name blog.example.com;
|
||||
if ($badagent) {
|
||||
rewrite ^ /ai;
|
||||
}
|
||||
location /ai {
|
||||
proxy_pass 127.0.0.1:42069;
|
||||
}
|
||||
location / {
|
||||
proxy_set_header Host $host;
|
||||
proxy_pass http://[::1]:8080;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Installation
|
||||
|
||||
`cargo install --path .`
|
||||
|
||||
Or, if you prefer Docker, an [image][oci:url] is available. If you're on NixOS, this repository is a flake, and provides a NixOS module to help deploying it. See [here](https://pages.madhouse-project.org/algernon/infrastructure.org/eru_services_iocaine) for how to use that.
|
||||
|
||||
For more detailed instructions, see [docs/deploying.md](docs/deploying.md).
|
||||
|
||||
## Configuration
|
||||
|
||||
`iocane` can be configured via a TOML-format configuration file, or via the environment. Almost everything has sane defaults, but providing a wordlist, and at least one source for the markov generator is **required**.
|
||||
|
||||
The configuration file is split into three main sections: [`[server]`](#server), [`[sources]`](#sources), and [`[generator]`](#generator).
|
||||
|
||||
### `[server]`
|
||||
|
||||
The `[server]` section is used to configure the address and port the server will listen on, via the `bind` property. The default is shown below:
|
||||
|
||||
``` toml
|
||||
[server]
|
||||
bind = "127.0.0.1:42069"
|
||||
```
|
||||
|
||||
This parameter is available as `IOCAINE_SERVER__BIND` when configuring via environment variables.
|
||||
|
||||
### `[sources]`
|
||||
|
||||
The `[sources]` section is the only section without defaults, specifying both options here is mandatory.
|
||||
|
||||
``` toml
|
||||
[sources]
|
||||
words = "/usr/share/dict/wamerican.txt"
|
||||
markov = ["/var/lib/iocaine/markov/bee-movie.txt", "/var/lib/iocaine/markov/moby-dick.txt"]
|
||||
```
|
||||
|
||||
The first option, `words`, refers to a word list file, with one word per line. When generating links, the *path* of the link will be a word chosen from this word list.
|
||||
|
||||
The second option, `markov`, is a list of files used to train the markov chain generator. These will be used to generate the main content.
|
||||
|
||||
These parameters are available as `IOCAINE_SOURCES__WORDS` and `IOCAINE_SOURCES__MARKOV`, respectively, when configuring via environment variables. Do note that if configuring `iocaine` this way, the `IOCAINE_SOURCES__MARKOV` environment variable *must* be a TOML list: `IOCAINE_SOURCES__MARKOV='["/var/lib/iocaine/markov/bee-movie.txt"]'`.
|
||||
|
||||
### `[generator]`
|
||||
|
||||
The `[generator]` section is used to describe how garbage is generated, how many paragraphs are produced per page, how many words they may have, how many links to place, and whether to add a "Back" link at the top. It looks like this, with defaults shown:
|
||||
|
||||
``` toml
|
||||
[generator.markov.paragraphs]
|
||||
min = 1
|
||||
max = 1
|
||||
|
||||
[generator.markov.words]
|
||||
min = 10
|
||||
max = 420
|
||||
|
||||
[generator.links]
|
||||
min = 2
|
||||
max = 5
|
||||
backlink = true
|
||||
```
|
||||
|
||||
When configuring through environment variables, these settings are available via `IOCAINE_GENERATOR__MARKOV__PARAGRAPHS__MIN`, `IOCAINE_GENERATOR__MARKOV__PARAGRAPHS_MAX`, `IOCAINE_GENERATOR__MARKOV__WORDS__MIN`, `IOCAINE_GENERATOR__MARKOV__WORDS__MAX`, `IOCAINE_GENERATOR__LINKS__MIN`, `IOCAINE_GENERATOR__LINKS__MAX`, and `IOCAINE_GENERATOR__LINKS__BACKLINK`, respectively.
|
||||
|
||||
## License & copyright
|
||||
|
||||
`iocaine` is © 2025 Gergely Nagy, with code adapted from [lipsum](https://github.com/mgeisler/lipsum) by [Martin Geisler](https://github.com/mgeisler), and is released under the [MIT](LICENSES/MIT.txt) license. A lot of `iocaine` has been inspired by [Nepenthes](https://zadzmo.org/code/nepenthes/), but shares no code with it, just ideas.
|
||||
|
||||
|
|
|
@ -15,7 +15,7 @@ SPDX-PackageDownloadLocation = "https://git.madhouse-project.org/algernon/iocain
|
|||
SPDX-License-Identifier = "MIT"
|
||||
|
||||
[[annotations]]
|
||||
path = ["README.md"]
|
||||
path = ["README.md", "docs/*.md"]
|
||||
precedence = "aggregate"
|
||||
SPDX-FileCopyrightText = "2025 Gergely Nagy"
|
||||
SPDX-License-Identifier = "MIT"
|
||||
|
|
16
data/compose.yaml
Normal file
16
data/compose.yaml
Normal file
|
@ -0,0 +1,16 @@
|
|||
# SPDX-FileCopyrightText: 2025 Gergely Nagy
|
||||
# SPDX-FileContributor: Gergely Nagy
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
services:
|
||||
iocaine:
|
||||
image: git.madhouse-project.org/algernon/iocaine:latest
|
||||
restart: unless-stopped
|
||||
ports:
|
||||
- '127.0.0.1:42069:42069'
|
||||
volumes:
|
||||
- ./container-volume:/data
|
||||
environment:
|
||||
- IOCAINE_SERVER__BIND="0.0.0.0:42069"
|
||||
- IOCAINE_SOURCES__WORDS="/data/wordlist.txt"
|
||||
- IOCAINE_SOURCES__MARKOV=["/data/training-text.txt"]
|
35
data/iocaine.service
Normal file
35
data/iocaine.service
Normal file
|
@ -0,0 +1,35 @@
|
|||
# SPDX-FileCopyrightText: 2025 Gergely Nagy
|
||||
# SPDX-FileContributor: Gergely Nagy
|
||||
#
|
||||
# SPDX-License-Identifier: MIT
|
||||
[Unit]
|
||||
After=network.target
|
||||
Description=iocaine, the deadliest poison known to AI
|
||||
|
||||
[Service]
|
||||
Type=simple
|
||||
ExecStart=/usr/local/bin/iocaine --config-file /usr/local/etc/iocaine/config.toml
|
||||
|
||||
DevicePolicy=closed
|
||||
DynamicUser=true
|
||||
LockPersonality=true
|
||||
MemoryDenyWriteExecute=true
|
||||
NoNewPrivileges=true
|
||||
PrivateDevices=true
|
||||
PrivateTmp=true
|
||||
ProtectClock=true
|
||||
ProtectControlGroups=true
|
||||
ProtectHostname=true
|
||||
ProtectKernelModules=true
|
||||
ProtectKernelTunables=true
|
||||
ProtectProc=invisible
|
||||
ProtectSystem=strict
|
||||
Restart=on-failure
|
||||
RestrictAddressFamilies=AF_INET
|
||||
RestrictAddressFamilies=AF_INET6
|
||||
RestrictNamespaces=true
|
||||
RestrictRealtime=true
|
||||
SystemCallArchitectures=native
|
||||
|
||||
[Install]
|
||||
WantedBy=multi-user.target
|
111
docs/deploying.md
Normal file
111
docs/deploying.md
Normal file
|
@ -0,0 +1,111 @@
|
|||
`iocaine` is a single binary, and apart from an optional configuration file, a wordlist, and some sources for its markov generator, there's nothing else it needs. It has no persistent state, no database, and writes nothing to disk. Nevertheless, it is a good idea to run it as its dedicated user, and never expose it to the open Internet - always run it behind a reverse proxy.
|
||||
|
||||
Because half the work - the routing of AI crawlers towards `iocaine` - is left up to the reverse proxy, deploying `iocaine` is going to be a two step process: the first step to deploy `iocaine` itself, and another to properly configure the reverse proxy.
|
||||
|
||||
Lets start with the first!
|
||||
|
||||
## Deploying `iocaine`
|
||||
|
||||
How to deploy `iocaine` highly depends on what kind of system you're using. Below, you will find examples for deploying with `systemd`, without it, with `docker`, and on NixOS, using the module this repository's flake provides. This section deals with deployment, configuration is documented in the main [README.md](../README.md#configuration).
|
||||
|
||||
<details>
|
||||
<summary>Deploying with <code>systemd</code></summary>
|
||||
|
||||
See the [`data/iocaine.service`](../data/iocaine.service) for a systemd service template. To use it, install `iocaine` somewhere, and copy the service file to `/etc/systemd/system/`, and edit it so it references the binary you installed, and the configuration file you prepared.
|
||||
|
||||
When done editing, you can `systemctl daemon-reload` (as root, of course), followed by `systemctl start iocaine`. If everything went well, you're done.
|
||||
|
||||
The provided systemd service tries to restrict the tool as much as possible, and uses `DynamicUser=true`, meaning that no user will need to be created, systemd will take care of it.
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Deploying without <code>systemd</code></summary>
|
||||
|
||||
To deploy without systemd, the easiest path is to create a dedicated user:
|
||||
|
||||
```shell
|
||||
useradd -m iocaine
|
||||
```
|
||||
|
||||
Then, place the `iocaine` binary and the configuration you prepared into this user's `$HOME`:
|
||||
|
||||
```shell
|
||||
mkdir -p $HOME/iocaine
|
||||
cp iocaine config.toml $HOME/iocaine/
|
||||
```
|
||||
|
||||
Then, you can run it like this:
|
||||
|
||||
```shell
|
||||
su -l -u iocaine /home/iocaine/iocaine/iocaine --config-file /home/iocaine/iocaine/config.toml
|
||||
```
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Deploying via Docker</summary>
|
||||
|
||||
There's an automatically built container image, for those who may wish to try - or deploy - `iocaine` via Docker. The best way to use it, is likely via `docker compose`. An example of that is provided in [`data/compose.yaml`](../data/compose.yaml).
|
||||
|
||||
To use it, place the word list and the training text in `data/container-volume`, and then you can simply start things up like this:
|
||||
|
||||
```shell
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
Voila!
|
||||
|
||||
</details>
|
||||
|
||||
<details>
|
||||
<summary>Deploying on NixOS</summary>
|
||||
|
||||
Deploying under NixOS is made simple by using the nixosModule provided by this repository's flake. It takes care of setting up the `systemd` service, sufficiently hardened, so all that is required of you is to enable the service, and configure the sources.
|
||||
|
||||
```nix
|
||||
{
|
||||
inputs = {
|
||||
nixpkgs.url = "github:nixos/nixpkgs/nixos-unstable";
|
||||
iocaine = {
|
||||
url = "git+https://git.madhouse-project.org/algernon/iocaine.git";
|
||||
inputs.nixpkgs.follows = "nixpkgs";
|
||||
};
|
||||
};
|
||||
|
||||
outputs = { ... }@inputs: {
|
||||
nixosConfigurations = {
|
||||
your-hostname = inputs.nixpkgs.lib.nixosSystem {
|
||||
inherit inputs;
|
||||
};
|
||||
modules = [
|
||||
inputs.iocaine.nixosModules.default
|
||||
(
|
||||
{
|
||||
inputs,
|
||||
lib,
|
||||
config,
|
||||
pkgs,
|
||||
...
|
||||
}:
|
||||
{
|
||||
services.iocaine = {
|
||||
enable = true;
|
||||
config = {
|
||||
sources = {
|
||||
words = "${pkgs.scowl}/share/dict/wamerican.txt";
|
||||
markov = [
|
||||
"/some/path/to/a/training-document.txt"
|
||||
];
|
||||
};
|
||||
};
|
||||
};
|
||||
}
|
||||
)
|
||||
];
|
||||
};
|
||||
};
|
||||
}
|
||||
```
|
||||
|
||||
</details>
|
13
docs/how-it-works.md
Normal file
13
docs/how-it-works.md
Normal file
|
@ -0,0 +1,13 @@
|
|||
# How does `iocaine` work?
|
||||
|
||||
The goal of `iocaine` is to generate a stable, infinite maze of garbage. Each page will be randomly generated, but as long as the configuration (and the training data) remains the same, each individual page will always render the same. Because `iocaine` is expected to work behind a reverse proxy, to shadow the real content when facing unwanted crawlers, it will generate different pages for different hosts, even if the path is the same.
|
||||
|
||||
This means that if `iocaine` is set up to shadow for both `site1.example.com` and `site2.example.com`, then `https://site1.example.com/some/path/` and `https://site2.example.com/some/path/` will render *different* content, but each visit to either will render the same.
|
||||
|
||||
This is accomplished by seeding the random number generator with a number derived from the SHA256 digest of the original request URL. This also means that you can deploy multiple `iocaine` nodes and load-balance between them, if so need be, and the output will be stable as long as the configuration and training data is the same between nodes.
|
||||
|
||||
For a number of reasons, `iocaine` uses a different seed for the markov-chain generated text, for the generated link URLs, and the generated link texts. All three are based on the original URL, though.
|
||||
|
||||
Such seeding, is, of course, not secure. But we do not need security here, we need each page to render in a stable way. If there's a collision, that happens, it's no big deal, we might end up with a mostly identical page - but we'll remain in the infinite maze, nevertheless.
|
||||
|
||||
Every page has the same structure: an optional "back" link (which just points to `../`), followed by a number of markov-chain generated paragraphs of various length, and an unordered list of links at the bottom. Each link is relative to the current page, has a random URI, and random text, too.
|
Loading…
Reference in a new issue