[feature] Block a bunch of "AI" crawlers (#2239)

* [feature] Block Google Bard/AI crawlers * [feature] Block the other OpenAI crawler * [feature] Block Common Crawl crawler This is used in research, but also gleefully advertises itself as the training source used in all LLMs and GPT-3. Fixes: #2240 * [feature] Block Omgilikebot Used by some shady big web data engine company. * [feature] Block Meta's language model crawler * [feature] Block well-known.dev crawler
2024-11-22 11:46:40 +00:00 · 2023-09-30 21:44:57 +02:00 · 2023-09-30 21:44:57 +02:00 · 0cce2c0838
parent 2b6b9cdf83
commit 0cce2c0838
1 changed files with 30 additions and 0 deletions
--- a/internal/web/robots.go
+++ b/internal/web/robots.go
@ -34,6 +34,36 @@
 User-agent: GPTBot
 Disallow: /

+# As of September 2023, GPTBot and ChatGPT-User are equivalent. But there's no telling
+# when OpenAI might decide to change that, so block this one too.
+User-agent: ChatGPT-User
+Disallow: /
+
+# And a giant fuck you to Google Bard and their other generative AI ventures too.
+# https://developers.google.com/search/docs/crawling-indexing/overview-google-crawlers
+User-agent: Google-Extended
+Disallow: /
+
+# Block CommonCrawl. Used in training LLMs and specifically GPT-3.
+# https://commoncrawl.org/faq
+User-agent: CCBot
+Disallow: /
+
+# Block Omgilike/Webz.io, a "Big Web Data" engine.
+# https://webz.io/blog/web-data/what-is-the-omgili-bot-and-why-is-it-crawling-your-website/
+User-agent: Omgilibot
+Disallow: /
+
+# Block Faceboobot, because Meta.
+# https://developers.facebook.com/docs/sharing/bot
+User-agent: FacebookBot
+Disallow: /
+
+# Well-known.dev crawler. Indexes stuff under /.well-known.
+# https://well-known.dev/about/
+User-agent: WellKnownBot
+Disallow: /
+
 # Rules for everything else.
 User-agent: *
 Crawl-delay: 500