The deadliest poison known to AI
Find a file
iadd 284af56e68 Do less allocating and copying when generating text
Before, on a low-capacity system (such as a an inexpensive cloud host),
doing Markov-chain text generation was _extraordinarily_ slow, taking
half a second or more to produce a page, and if multiple requests came
in simultaneously they could easily swamp the capacity of such a system.

Most of the time was spent in the Words iterator, which did a bunch of
cloning of Strings in what the hot path.

This changes the Markov generator's internal representation - now, instead
of storing Strings, it stores index-pairs into a single shared String,
normalized so that all references to particular words are collapsed into
a single pair.  This also means that the hash map is working with
fixed-size values, which can't hurt.

In addition, it does only one hash-map lookup per generated word in the
happy-path of not reaching the end of the chain.

The upshot of all this is that where it was taking a half-second or more
to generate a page, it now takes about 0.001 seconds.

On the downside, the initialization of WurstsalatGeneratorPro has become
rather less flexible.  Before, you created one and then taught it various
strings, or gave it a list of paths to read and teach itself from.  Now,
the _only_ way to create one is directly with a list of paths.  Changing
this is possible, but it means `Substr` would have to learn to distinguish
which source data it came from, which would mean a likely 50% increase in
its size.  It didn't seem worth it to preserve that capability, which
wasn't even being used.
2025-02-10 08:18:54 -08:00
.forgejo/workflows ci: Update to use cellar 2025-02-10 14:20:37 +01:00
data data: Update the grafana dashboard 2025-02-09 09:48:28 +01:00
docs metrics: No need to set instance 2025-02-06 22:01:54 +01:00
LICENSES Initial import 2025-01-16 10:44:56 +01:00
nix Add a number of tests 2025-01-30 09:18:35 +01:00
src Do less allocating and copying when generating text 2025-02-10 08:18:54 -08:00
templates Make templating actually useful 2025-01-29 00:20:21 +01:00
tests Make the metrics always available 2025-02-07 15:07:40 +01:00
.envrc Initial import 2025-01-16 10:44:56 +01:00
.gitattributes .gitattributes: Try to mark markdown files detectable 2025-01-25 12:40:00 +01:00
.gitignore Move documentation to a dedicated site 2025-01-25 01:31:38 +01:00
.gitmodules Move documentation to a dedicated site 2025-01-25 01:31:38 +01:00
Cargo.lock Implement Prometheus-compatible, optional metrics 2025-02-05 02:36:13 +01:00
Cargo.toml Make the metrics always available 2025-02-07 15:07:40 +01:00
flake.lock flake.lock: Update 2025-02-09 11:17:25 +01:00
flake.nix nix: Add zstd to the devshell 2025-01-29 08:23:40 +01:00
README.md Move documentation to a dedicated site 2025-01-25 01:31:38 +01:00
REUSE.toml docs: Add a HOWTO about monitoring 2025-02-05 11:26:49 +01:00

iocaine

Build status Container image Demo Documentation

The deadliest poison known to AI.

This is a tarpit, modeled after Nepenthes, intended to catch unwelcome web crawlers, but with a slightly different, more aggressive intended usage scenario. The core idea is to configure a reverse proxy to serve content generated by iocaine to AI crawlers, but normal content to every other visitor. This differs from Nepenthes, where the idea is to link to it, and trap crawlers that way. Not with iocaine, where the trap is laid by the reverse proxy.

iocaine does not try to slow crawlers. It does not try to waste their time that way - that is left up to the reverse proxy. iocaine is purely about generating garbage.

For more information about what this is, how it works, and how to deploy it, have a look at the dedicated website.

Lets make AI poisoning the norm. If we all do it, they won't have anything to crawl.