home.social

#robotstxt — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #robotstxt, aggregated by home.social.

  1. CW: google AI vs. your personal webpage

    Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).

    Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.

    (If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the noindex metatag to all your pages.)

    #robotstxt #webSearch #noAI

  2. CW: google AI vs. your personal webpage

    Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).

    Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.

    (If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the noindex metatag to all your pages.)

    #robotstxt #webSearch #noAI

  3. CW: google AI vs. your personal webpage

    Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advice suggest to update your robots.txt to disallow Googlebot (their original search index robot).

    Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.

    (If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the noindex metatag to all your pages.)

    Edit: robots.txt is not a catch-all, since it is only a suggestion for "nice" crawlers, and its instructions will be ignored by "nasty" crawlers. I reworded the post from "advice" to "suggest", since this action might not actually change anything for you.

    #robotstxt #webSearch #noAI

  4. CW: google AI vs. your personal webpage

    Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).

    Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.

    (If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the noindex metatag to all your pages.)

    #robotstxt #webSearch #noAI

  5. CW: google AI vs. your personal webpage

    Now that Google have announced their intention to discontinue Web Search, and replace it with LLM Summaries Only, I advise everyone that cares to update your robots.txt to disallow Googlebot (their original search index robot).

    Because after summer, there will be no web search to index your website, the data gathered by the "good" index robot will at best be discarded, or at worst, be fed to LLMs, leading to plagiarism of your content. In either scenario, no human visitors will be guided to your website.

    (If you don't want to be indexed by any search engine at all, you can disallow all robots. You can also add the noindex metatag to all your pages.)

    #robotstxt #webSearch #noAI

  6. Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.

    https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/
  7. Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.

    https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/
  8. Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.

    https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/
  9. Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.

    https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/
  10. Robots.txt zůstává základní signál pro slušné crawlery, ale už neumí popsat hlavní problém: stejný veřejný obsah může sloužit klasickému vyhledávání, AI odpovědím, tréninku modelů i načtení na pokyn uživatele. Provozovatel webu proto musí oddělit účel přístupu, ověřovat identitu botů, měřit dopad na infrastrukturu a u hodnotného obsahu řešit i vynucení pravidel mimo samotný robots.txt.

    https://zdrojak.cz/clanky/robots-txt-nestaci-ai-crawleri-meni-jak-weby-chrani-obsah/
  11. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  12. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  13. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  14. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  15. So, with Google announcing "Search is going full-AI, we won't be sending traffic to the original sites any more", someone else pointed out that this eradication of the traditional search-engine compact - we let you crawl our sites to create your index, and you send visitors to our sites when relevant - means that we can, and should, block all of Google's crawlers now. If they're going to just take, take, take and give nothing back, why let them access your content at all?

    But this is cute. Besides the fact that Google documents that some of their crawlers ignore robots.txt, there's this bit of fun. On this page (developers.google.com/crawling), they link to "the Google list of user agents" (developers.google.com/crawling).

    However, that links to 3 separate pages of them, and *each of those pages explicitly states that is not comprehensive, but only the ones they commonly get questions about*. And of course, none of the "User-triggered fetchers" obey robots.txt, along with some others.

    So Google isn't even reporting the full list of user-agents that can be used to stop their crawling.

    That is some bullshit.

    #Google #crawler #RobotsTxt #UserAgent #bullshit #antisocial #web #search #WebSearch #LLM #AI

  16. Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
    weirdgloop.org/blog/clankers
    #via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+

  17. Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
    weirdgloop.org/blog/clankers
    #via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+

  18. Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
    weirdgloop.org/blog/clankers
    #via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+

  19. Scrapers vs Wikis: Person who runs a bunch of custom Wiki websites writes about abuse from scrapers
    weirdgloop.org/blog/clankers
    #via:lobsters #robotstxt #scraping #scaling #wiki #web #ai #+

  20. just over here modifying #robotstxt to block everything #google like a normal person in 2026

  21. just over here modifying #robotstxt to block everything #google like a normal person in 2026

  22. just over here modifying #robotstxt to block everything #google like a normal person in 2026

  23. just over here modifying #robotstxt to block everything #google like a normal person in 2026

  24. just over here modifying #robotstxt to block everything #google like a normal person in 2026

  25. Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

    Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.

    habr.com/ru/articles/1027876/

    #robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы

  26. Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

    Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.

    habr.com/ru/articles/1027876/

    #robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы

  27. Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

    Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.

    habr.com/ru/articles/1027876/

    #robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы

  28. Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

    Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.

    habr.com/ru/articles/1027876/

    #robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы

  29. FYI: Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #LLMSTXT #Fortune500 #AIVisibility #RobotsTxt #JSONLD

  30. FYI: Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #LLMSTXT #Fortune500 #AIVisibility #RobotsTxt #JSONLD

  31. FYI: Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #LLMSTXT #Fortune500 #AIVisibility #RobotsTxt #JSONLD

  32. Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #Fortune500 #AI #llms #robotsTxt #JSONLD

  33. Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #Fortune500 #AI #llms #robotsTxt #JSONLD

  34. Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #Fortune500 #AI #llms #robotsTxt #JSONLD

  35. Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #Fortune500 #AI #llms #robotsTxt #JSONLD

  36. Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. ppc.land/only-7-4-of-fortune-5 #Fortune500 #AI #llms #robotsTxt #JSONLD