#robotstxt — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #robotstxt, aggregated by home.social.
-
CW: google AI vs. your personal webpage
Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).
Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.
(If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the
noindexmetatag to all your pages.) -
CW: google AI vs. your personal webpage
Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).
Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.
(If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the
noindexmetatag to all your pages.) -
CW: google AI vs. your personal webpage
Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advice suggest to update your robots.txt to disallow Googlebot (their original search index robot).
Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.
(If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the
noindexmetatag to all your pages.)Edit: robots.txt is not a catch-all, since it is only a suggestion for "nice" crawlers, and its instructions will be ignored by "nasty" crawlers. I reworded the post from "advice" to "suggest", since this action might not actually change anything for you.
-
CW: google AI vs. your personal webpage
Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).
Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.
(If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the
noindexmetatag to all your pages.) -
CW: google AI vs. your personal webpage
Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).
Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.
(If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the
noindexmetatag to all your pages.) -
Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.
https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/ -
Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.
https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/ -
Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.
https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/ -
Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.
https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/ -
Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.
https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/ -
So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?
But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (https://developers.google.com/crawling/docs/robots-txt/create-robots-txt), they link to "the Google list of user agents" (https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers).
However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.
So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.
That is some bullshit.
#Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI
-
So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?
But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (https://developers.google.com/crawling/docs/robots-txt/create-robots-txt), they link to "the Google list of user agents" (https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers).
However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.
So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.
That is some bullshit.
#Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI
-
So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?
But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (https://developers.google.com/crawling/docs/robots-txt/create-robots-txt), they link to "the Google list of user agents" (https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers).
However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.
So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.
That is some bullshit.
#Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI
-
So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?
But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (https://developers.google.com/crawling/docs/robots-txt/create-robots-txt), they link to "the Google list of user agents" (https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers).
However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.
So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.
That is some bullshit.
#Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI
-
So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?
But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (https://developers.google.com/crawling/docs/robots-txt/create-robots-txt), they link to "the Google list of user agents" (https://developers.google.com/crawling/docs/crawlers-fetchers/overview-google-crawlers).
However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.
So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.
That is some bullshit.
#Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI
-
Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
https://weirdgloop.org/blog/clankers
#via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+ -
Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
https://weirdgloop.org/blog/clankers
#via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+ -
Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
https://weirdgloop.org/blog/clankers
#via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+ -
Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
https://weirdgloop.org/blog/clankers
#via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+ -
just over here modifying #robotstxt to block everything #google like a normal person in 2026
-
just over here modifying #robotstxt to block everything #google like a normal person in 2026
-
just over here modifying #robotstxt to block everything #google like a normal person in 2026
-
just over here modifying #robotstxt to block everything #google like a normal person in 2026
-
just over here modifying #robotstxt to block everything #google like a normal person in 2026
-
Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса
Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.
https://habr.com/ru/articles/1027876/
#robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы
-
Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса
Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.
https://habr.com/ru/articles/1027876/
#robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы
-
Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса
Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.
https://habr.com/ru/articles/1027876/
#robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы
-
Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса
Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.
https://habr.com/ru/articles/1027876/
#robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы
-
How does your robots.txt look like?
-
How does your robots.txt look like?
-
How does your robots.txt look like?
-
How does your robots.txt look like?
-
The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims
https://fed.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/
-
The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims
https://web.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/
-
The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims
https://fed.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/
-
The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims
https://fed.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/
-
The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims
https://fed.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/
-
#Development #Launches
Is Your Site Agent-Ready? · Scan your website for agent-friendly standards https://ilo.im/16c93a_____
#Website #AI #Agents #MCP #Commerce #Content #RobotsTxt #Sitemap #WebDev #Frontend -
#Development #Launches
Is Your Site Agent-Ready? · Scan your website for agent-friendly standards https://ilo.im/16c93a_____
#Website #AI #Agents #MCP #Commerce #Content #RobotsTxt #Sitemap #WebDev #Frontend -
#Development #Launches
Is Your Site Agent-Ready? · Scan your website for agent-friendly standards https://ilo.im/16c93a_____
#Website #AI #Agents #MCP #Commerce #Content #RobotsTxt #Sitemap #WebDev #Frontend -
#Development #Launches
Is Your Site Agent-Ready? · Scan your website for agent-friendly standards https://ilo.im/16c93a_____
#Website #AI #Agents #MCP #Commerce #Content #RobotsTxt #Sitemap #WebDev #Frontend -
FYI: Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #LLMSTXT #Fortune500 #AIVisibility #RobotsTxt #JSONLD
-
FYI: Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #LLMSTXT #Fortune500 #AIVisibility #RobotsTxt #JSONLD
-
FYI: Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #LLMSTXT #Fortune500 #AIVisibility #RobotsTxt #JSONLD
-
Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #Fortune500 #AI #llms #robotsTxt #JSONLD
-
Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #Fortune500 #AI #llms #robotsTxt #JSONLD
-
Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #Fortune500 #AI #llms #robotsTxt #JSONLD
-
Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #Fortune500 #AI #llms #robotsTxt #JSONLD
-
Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #Fortune500 #AI #llms #robotsTxt #JSONLD
-
#Development #Explainers
Inside Googlebot · How Google’s crawl system decides which content gets indexed https://ilo.im/16btho_____
#Business #Google #SearchEngine #SEO #Crawlers #Content #RobotsTxt #Development #WebDev #Frontend