gotosocial/internal/web/robots.go

// GoToSocial
// Copyright (C) GoToSocial Authors admin@gotosocial.org
// SPDX-License-Identifier: AGPL-3.0-or-later
//
// This program is free software: you can redistribute it and/or modify
// it under the terms of the GNU Affero General Public License as published by
// the Free Software Foundation, either version 3 of the License, or
// (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU Affero General Public License for more details.
//
// You should have received a copy of the GNU Affero General Public License
// along with this program.  If not, see <http://www.gnu.org/licenses/>.

package web

import (
	"net/http"

	"github.com/gin-gonic/gin"
)

const (
	robotsPath          = "/robots.txt"
	robotsMetaAllowSome = "nofollow, noarchive, nositelinkssearchbox, max-image-preview:standard" // https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#robotsmeta
	robotsTxt           = `# GoToSocial robots.txt -- to edit, see internal/web/robots.go
# More info @ https://developers.google.com/search/docs/crawling-indexing/robots/intro

# AI scrapers and the like.
# https://github.com/ai-robots-txt/ai.robots.txt/
User-agent: AdsBot-Google
User-agent: Amazonbot
User-agent: anthropic-ai
User-agent: Applebot-Extended
User-agent: Bytespider
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: ClaudeBot
User-agent: Claude-Web
User-agent: cohere-ai
User-agent: Diffbot
User-agent: FacebookBot
User-agent: FriendlyCrawler
User-agent: Google-Extended
User-agent: GoogleOther
User-agent: GPTBot
User-agent: img2dataset
User-agent: omgili
User-agent: omgilibot
User-agent: peer39_crawler
User-agent: peer39_crawler/1.0
User-agent: PerplexityBot
User-agent: YouBot
Disallow: /

# Marketing/SEO "intelligence" data scrapers
User-agent: AwarioRssBot
User-agent: AwarioSmartBot
User-agent: DataForSeoBot
User-agent: ImagesiftBot
User-agent: magpie-crawler
User-agent: Meltwater
User-agent: PiplBot
User-agent: scoop.it
User-agent: Seekr
Disallow: /

# Well-known.dev crawler. Indexes stuff under /.well-known.
# https://well-known.dev/about/
User-agent: WellKnownBot     
Disallow: /   

# Rules for everything else.
User-agent: *
Crawl-delay: 500

# API endpoints.
Disallow: /api/

# Auth/Sign in endpoints.
Disallow: /auth/
Disallow: /oauth/
Disallow: /check_your_email
Disallow: /wait_for_approval
Disallow: /account_disabled
Disallow: /signup

# Well-known endpoints.
Disallow: /.well-known/

# Fileserver/media.
Disallow: /fileserver/

# Fedi S2S API endpoints.
Disallow: /users/
Disallow: /emoji/

# Settings panels.
Disallow: /admin
Disallow: /user
Disallow: /settings/

# Domain blocklist.
Disallow: /about/suspended`
)

// robotsGETHandler returns a decent robots.txt that prevents crawling
// the api, auth pages, settings pages, etc.
//
// More granular robots meta tags are then applied for web pages
// depending on user preferences (see internal/web).
func (m *Module) robotsGETHandler(c *gin.Context) {
	c.String(http.StatusOK, robotsTxt)
}
[chore] Improve copyright header handling (#1608) * [chore] Remove years from all license headers Years or year ranges aren't required in license headers. Many projects have removed them in recent years and it avoids a bit of yearly toil. In many cases our copyright claim was also a bit dodgy since we added the 2021-2023 header to files created after 2021 but you can't claim copyright into the past that way. * [chore] Add license header check This ensures a license header is always added to any new file. This avoids maintainers/reviewers needing to remember to check for and ask for it in case a contribution doesn't include it. * [chore] Add missing license headers * [chore] Further updates to license header * Use the more common // indentend comment format * Remove the hack we had for the linter now that we use the // format * Add SPDX license identifier 2023-03-12 15:00:57 +00:00			`// GoToSocial`
			`// Copyright (C) GoToSocial Authors admin@gotosocial.org`
			`// SPDX-License-Identifier: AGPL-3.0-or-later`
			`//`
			`// This program is free software: you can redistribute it and/or modify`
			`// it under the terms of the GNU Affero General Public License as published by`
			`// the Free Software Foundation, either version 3 of the License, or`
			`// (at your option) any later version.`
			`//`
			`// This program is distributed in the hope that it will be useful,`
			`// but WITHOUT ANY WARRANTY; without even the implied warranty of`
			`// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the`
			`// GNU Affero General Public License for more details.`
			`//`
			`// You should have received a copy of the GNU Affero General Public License`
			`// along with this program. If not, see <http://www.gnu.org/licenses/>.`
[feature] Add `meta robots` tag; allow robots to index profile card if user is Discoverable (#842) * rework robots.txt response * don't let robots snippet from statuses/threads * allow robots to index if user is Discoverable * add license text 2022-09-29 10:03:17 +00:00
			`package web`

[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`import (`
			`"net/http"`

			`"github.com/gin-gonic/gin"`
			`)`

[feature] Add `meta robots` tag; allow robots to index profile card if user is Discoverable (#842) * rework robots.txt response * don't let robots snippet from statuses/threads * allow robots to index if user is Discoverable * add license text 2022-09-29 10:03:17 +00:00			`const (`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`robotsPath = "/robots.txt"`
			`robotsMetaAllowSome = "nofollow, noarchive, nositelinkssearchbox, max-image-preview:standard" // https://developers.google.com/search/docs/crawling-indexing/robots-meta-tag#robotsmeta`
			robotsTxt = `# GoToSocial robots.txt -- to edit, see internal/web/robots.go
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00			`# More info @ https://developers.google.com/search/docs/crawling-indexing/robots/intro`

[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`# AI scrapers and the like.`
			`# https://github.com/ai-robots-txt/ai.robots.txt/`
			`User-agent: AdsBot-Google`
			`User-agent: Amazonbot`
			`User-agent: anthropic-ai`
[chore] Update our robots.txt (#3033) This syncs our copy with the current state of the ai.robots.txt repository. Upstream has tightened their scope to be AI-only, whereas before it included a bunch of SEO and "web intelligence" marketing stuff. I've kept those but moved them into their own section. 2024-06-23 13:34:21 +00:00			`User-agent: Applebot-Extended`
[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`User-agent: Bytespider`
[feature] Block a bunch of "AI" crawlers (#2239) * [feature] Block Google Bard/AI crawlers * [feature] Block the other OpenAI crawler * [feature] Block Common Crawl crawler This is used in research, but also gleefully advertises itself as the training source used in all LLMs and GPT-3. Fixes: #2240 * [feature] Block Omgilikebot Used by some shady big web data engine company. * [feature] Block Meta's language model crawler * [feature] Block well-known.dev crawler 2023-09-30 19:44:57 +00:00			`User-agent: CCBot`
[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`User-agent: ChatGPT-User`
			`User-agent: ClaudeBot`
			`User-agent: Claude-Web`
			`User-agent: cohere-ai`
[chore] Update our robots.txt (#3033) This syncs our copy with the current state of the ai.robots.txt repository. Upstream has tightened their scope to be AI-only, whereas before it included a bunch of SEO and "web intelligence" marketing stuff. I've kept those but moved them into their own section. 2024-06-23 13:34:21 +00:00			`User-agent: Diffbot`
[feature] Block a bunch of "AI" crawlers (#2239) * [feature] Block Google Bard/AI crawlers * [feature] Block the other OpenAI crawler * [feature] Block Common Crawl crawler This is used in research, but also gleefully advertises itself as the training source used in all LLMs and GPT-3. Fixes: #2240 * [feature] Block Omgilikebot Used by some shady big web data engine company. * [feature] Block Meta's language model crawler * [feature] Block well-known.dev crawler 2023-09-30 19:44:57 +00:00			`User-agent: FacebookBot`
[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`User-agent: FriendlyCrawler`
			`User-agent: Google-Extended`
			`User-agent: GoogleOther`
			`User-agent: GPTBot`
[chore] Update our robots.txt (#3033) This syncs our copy with the current state of the ai.robots.txt repository. Upstream has tightened their scope to be AI-only, whereas before it included a bunch of SEO and "web intelligence" marketing stuff. I've kept those but moved them into their own section. 2024-06-23 13:34:21 +00:00			`User-agent: img2dataset`
[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`User-agent: omgili`
			`User-agent: omgilibot`
			`User-agent: peer39_crawler`
			`User-agent: peer39_crawler/1.0`
			`User-agent: PerplexityBot`
[chore] Update our robots.txt (#3033) This syncs our copy with the current state of the ai.robots.txt repository. Upstream has tightened their scope to be AI-only, whereas before it included a bunch of SEO and "web intelligence" marketing stuff. I've kept those but moved them into their own section. 2024-06-23 13:34:21 +00:00			`User-agent: YouBot`
			`Disallow: /`

			`# Marketing/SEO "intelligence" data scrapers`
			`User-agent: AwarioRssBot`
			`User-agent: AwarioSmartBot`
			`User-agent: DataForSeoBot`
			`User-agent: ImagesiftBot`
			`User-agent: magpie-crawler`
			`User-agent: Meltwater`
[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`User-agent: PiplBot`
[chore] Update our robots.txt (#3033) This syncs our copy with the current state of the ai.robots.txt repository. Upstream has tightened their scope to be AI-only, whereas before it included a bunch of SEO and "web intelligence" marketing stuff. I've kept those but moved them into their own section. 2024-06-23 13:34:21 +00:00			`User-agent: scoop.it`
[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`User-agent: Seekr`
[feature] Block a bunch of "AI" crawlers (#2239) * [feature] Block Google Bard/AI crawlers * [feature] Block the other OpenAI crawler * [feature] Block Common Crawl crawler This is used in research, but also gleefully advertises itself as the training source used in all LLMs and GPT-3. Fixes: #2240 * [feature] Block Omgilikebot Used by some shady big web data engine company. * [feature] Block Meta's language model crawler * [feature] Block well-known.dev crawler 2023-09-30 19:44:57 +00:00			`Disallow: /`

			`# Well-known.dev crawler. Indexes stuff under /.well-known.`
			`# https://well-known.dev/about/`
[chore] Update robots.txt (#2856) This updates the robots.txt based on the list of the ai.robots.txt repository. We can look at automating that at some point. It's worth pointing out that some robots, namely the ones by Bytedance, are known to ignore robots.txt entirely. 2024-04-22 09:01:37 +00:00			`User-agent: WellKnownBot`
			`Disallow: /`
[feature] Block Amazonbot (#2692) Blocks the Amazon crawler bot. Closes: #2686 2024-02-27 13:25:08 +00:00
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00			`# Rules for everything else.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`User-agent: *`
			`Crawl-delay: 500`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00
			`# API endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`Disallow: /api/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00
[chore] Refactor HTML templates and CSS (#2480) * [chore] Refactor HTML templates and CSS * eslint * ignore "Local" * rss tests * fiddle with OG just a tiny bit * dick around with polls a bit more so SR stops saying "clickable" * remove break * oh lord * don't lazy load avatar * fix ogmeta tests * clean up some cruft * catch remaining calls to c.HTML * fix error rendering + stack overflow in tag * allow templating attributes * fix indent * set aria-hidden on status complementary content, since it's already present in the label anyway * tidy up templating calls a little * try to make styling a bit more consistent + readable * fix up some remaining CSS issues * fix up reports 2023-12-27 10:23:52 +00:00			`# Auth/Sign in endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`Disallow: /auth/`
			`Disallow: /oauth/`
			`Disallow: /check_your_email`
			`Disallow: /wait_for_approval`
			`Disallow: /account_disabled`
[feature] New user sign-up via web page (#2796) * [feature] User sign-up form and admin notifs * add chosen + filtered languages to migration * remove stray comment * chosen languages schmosen schmanguages * proper error on local account missing 2024-04-11 09:45:53 +00:00			`Disallow: /signup`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00
			`# Well-known endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`Disallow: /.well-known/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00
			`# Fileserver/media.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`Disallow: /fileserver/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00
			`# Fedi S2S API endpoints.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`Disallow: /users/`
			`Disallow: /emoji/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00
			`# Settings panels.`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00			`Disallow: /admin`
			`Disallow: /user`
[feature] Public list of suspended domains (#1362) * basic rendered domain blocklist (unauthenticated!) * style basic domain block list * better formatting for domain blocklist * add opt-in config option for showing suspended domains * format/linter * re-use InstancePeersGet for web-accessible domain blocklist * reword explanation, border styling * always attach blocklist handler, update error message * domain blocklist error message grammar 2023-01-25 17:06:41 +00:00			`Disallow: /settings/`
[chore] Update robots.txt, give chatgpt the middle finger (#2085) 2023-08-08 11:16:34 +00:00
			`# Domain blocklist.`
[feature] Public list of suspended domains (#1362) * basic rendered domain blocklist (unauthenticated!) * style basic domain block list * better formatting for domain blocklist * add opt-in config option for showing suspended domains * format/linter * re-use InstancePeersGet for web-accessible domain blocklist * reword explanation, border styling * always attach blocklist handler, update error message * domain blocklist error message grammar 2023-01-25 17:06:41 +00:00			Disallow: /about/suspended`
[feature] Add `meta robots` tag; allow robots to index profile card if user is Discoverable (#842) * rework robots.txt response * don't let robots snippet from statuses/threads * allow robots to index if user is Discoverable * add license text 2022-09-29 10:03:17 +00:00			`)`
[chore] The Big Middleware and API Refactor (tm) (#1250) * interim commit: start refactoring middlewares into package under router * another interim commit, this is becoming a big job * another fucking massive interim commit * refactor bookmarks to new style * ambassador, wiz zeze commits you are spoiling uz * she compiles, we're getting there * we're just normal men; we're just innocent men * apiutil * whoopsie * i'm glad noone reads commit msgs haha :blob_sweat: * use that weirdo go-bytesize library for maxMultipartMemory * fix media module paths 2023-01-02 12:10:50 +00:00
			`// robotsGETHandler returns a decent robots.txt that prevents crawling`
			`// the api, auth pages, settings pages, etc.`
			`//`
			`// More granular robots meta tags are then applied for web pages`
			`// depending on user preferences (see internal/web).`
			`func (m Module) robotsGETHandler(c gin.Context) {`
			`c.String(http.StatusOK, robotsTxt)`
			`}`