From 4604224c4d4d286d4f4e860e3c7fe7c61a4a2452 Mon Sep 17 00:00:00 2001 From: Daenney Date: Sun, 23 Jun 2024 15:34:21 +0200 Subject: [PATCH] [chore] Update our robots.txt (#3033) This syncs our copy with the current state of the ai.robots.txt repository. Upstream has tightened their scope to be AI-only, whereas before it included a bunch of SEO and "web intelligence" marketing stuff. I've kept those but moved them into their own section. --- internal/web/robots.go | 24 +++++++++++++++--------- 1 file changed, 15 insertions(+), 9 deletions(-) diff --git a/internal/web/robots.go b/internal/web/robots.go index 58b541413..9ecf58182 100644 --- a/internal/web/robots.go +++ b/internal/web/robots.go @@ -34,34 +34,40 @@ User-agent: AdsBot-Google User-agent: Amazonbot User-agent: anthropic-ai -User-agent: Applebot -User-agent: AwarioRssBot -User-agent: AwarioSmartBot +User-agent: Applebot-Extended User-agent: Bytespider User-agent: CCBot User-agent: ChatGPT-User User-agent: ClaudeBot User-agent: Claude-Web User-agent: cohere-ai -User-agent: DataForSeoBot +User-agent: Diffbot User-agent: FacebookBot User-agent: FriendlyCrawler User-agent: Google-Extended User-agent: GoogleOther User-agent: GPTBot -User-agent: ImagesiftBot -User-agent: magpie-crawler -User-agent: Meltwater +User-agent: img2dataset User-agent: omgili User-agent: omgilibot User-agent: peer39_crawler User-agent: peer39_crawler/1.0 User-agent: PerplexityBot -User-agent: PiplBot -User-agent: Seekr User-agent: YouBot Disallow: / +# Marketing/SEO "intelligence" data scrapers +User-agent: AwarioRssBot +User-agent: AwarioSmartBot +User-agent: DataForSeoBot +User-agent: ImagesiftBot +User-agent: magpie-crawler +User-agent: Meltwater +User-agent: PiplBot +User-agent: scoop.it +User-agent: Seekr +Disallow: / + # Well-known.dev crawler. Indexes stuff under /.well-known. # https://well-known.dev/about/ User-agent: WellKnownBot