#robots-txt — Public Fediverse posts on home.social

Habr @[email protected] · 2026-04-25 · 14:42 UTC

Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.

https://habr.com/ru/articles/1027876/

#robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы

#рекомендательные_системы #синхронный_просмотр #cinema_dna #googlebot #sitemap #indexnow

proedie @[email protected] · 2026-04-24 · 17:20 UTC

How does your robots.txt look like?

#question #fedipower #website #robotstxt

WIRED - The Latest in Technology, Science, Culture and Business [Unofficial] @[email protected] · 2026-04-22 · 09:30 UTC

The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims

https://web.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/

#culture #culturedigitalculture #artificialintelligence #socialmedia #twitter #reddit

Inautilo @[email protected] · 2026-04-21 · 19:59 UTC

#Development #Launches
Is Your Site Agent-Ready? · Scan your website for agent-friendly standards https://ilo.im/16c93a

_____
#Website #AI #Agents #MCP #Commerce #Content #RobotsTxt #Sitemap #WebDev #Frontend

#development #launches #website #ai #agents #mcp

PPC Land @[email protected] · 2026-04-05 · 14:36 UTC

Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #Fortune500 #AI #llms #robotsTxt #JSONLD

#fortune500 #ai #llms #robotstxt #jsonld

Inautilo @[email protected] · 2026-04-01 · 13:29 UTC

#Development #Explainers
Inside Googlebot · How Google’s crawl system decides which content gets indexed https://ilo.im/16btho

_____
#Business #Google #SearchEngine #SEO #Crawlers #Content #RobotsTxt #Development #WebDev #Frontend

#content #robotstxt #webdev #frontend #development #explainers

C. @[email protected] · 2026-03-28 · 07:01 UTC

Oh, this is #fun.

#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.

I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.

And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.

Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829

So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.

Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.

#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer

#fun #applebot #yonks #apple #appleinc #timapple

PPC Land @[email protected] · 2026-03-25 · 14:21 UTC

FYI: Czech publishers get new robots.txt shield against AI scrapers: SPIR on March 19 updated its standard for Czech online publishers to opt out of AI text and data mining, adding real-time response crawlers to the scope of the robots.txt framework. https://ppc.land/czech-publishers-get-new-robots-txt-shield-against-ai-scrapers/ #AI #robotstxt #datautajení #česképublikace #ochranadat

#ai #robotstxt #datautajeni #ceskepublikace #ochranadat

PPC Land @[email protected] · 2026-03-23 · 14:20 UTC

ICYMI: Czech publishers get new robots.txt shield against AI scrapers: SPIR on March 19 updated its standard for Czech online publishers to opt out of AI text and data mining, adding real-time response crawlers to the scope of the robots.txt framework. https://ppc.land/czech-publishers-get-new-robots-txt-shield-against-ai-scrapers/ #technologie #publikace #AI #robotstxt #czechpublishing

#technologie #publikace #ai #robotstxt #czechpublishing

Frontend Dogma @[email protected] · 2026-03-13 · 13:00 UTC

The Dark Side of AI No One Talks About, by @jammer_volts (@mozseo.bsky.social):

https://moz.com/blog/dark-side-of-ai

#ai #seo #robotstxt

#robotstxt #seo #ai

Inautilo @[email protected] · 2026-03-10 · 14:45 UTC

#Development #Findings
Markdown, llms.txt, and AI crawlers · Do Markdown and llms.txt matter for your website? https://ilo.im/16b5qb

_____
#Business #SEO #SearchEngines #AI #Crawlers #Content #Website #Markdown #LlmsTxt #RobotsTxt

#development #findings #business #seo #searchengines #ai

Habr @[email protected] · 2026-03-03 · 06:02 UTC

ИИ уже читает ваш сайт, но по каким правилам? LLMs.txt, robots.txt и контроль агентов

Еще пару лет назад веб жил в простой и понятной модели: есть сайты, есть поисковые роботы, есть пользователи. Роботы приходят, сканируют страницы, кладут их в индекс — дальше начинается привычная борьба за позиции в выдаче. Эта логика десятилетиями определяла, как мы строим сайты, настраиваем SEO и пишем robots.txt. С появлением LLM-агентов эта модель начала трещать по швам.

https://habr.com/ru/articles/1004924/

#robotstxt #llmstxt #llms #llmsfulltxt #yandex #google

#google #yandex #llmsfulltxt #llms #llmstxt #robotstxt

Inautilo @[email protected] · 2026-03-02 · 05:05 UTC

#Business #Reports
Anthropic details how Claude crawls sites · How to block the three separate user agents https://ilo.im/16ax7y

_____
#AI #Claude #Crawlers #UserAgents #RobotsTxt #Content #Website #WebDev #Frontend #Backend

#business #reports #ai #claude #crawlers #useragents

Hacker News @[email protected] · 2026-02-23 · 13:10 UTC

Facebook's Fascination with My Robots.txt

https://blog.nytsoi.net/2026/02/23/facebook-robots-txt

#HackerNews #Facebook #RobotsTxt #SocialMedia #TechNews #WebCrawlers

#hackernews #facebook #robotstxt #socialmedia #technews #webcrawlers

Nathaniel Daught @[email protected] · 2026-02-20 · 19:37 UTC

Wow 28 new AI crawlers added to ai.robots.txt since I last updated in August.

https://github.com/ai-robots-txt/ai.robots.txt

#AI #webdev #robotstxt

#ai #webdev #robotstxt

Inautilo @[email protected] · 2026-02-10 · 03:57 UTC

#Development #Challenges
Webspace invaders · Let’s level up our anti-AI scraping game! https://ilo.im/16ahl8

_____
#AI #Crawlers #RobotsTxt #RateLimiting #WAFs #Cloudflare #IndieWeb #WebDev #Frontend #Backend

#development #challenges #ai #crawlers #robotstxt #ratelimiting

Habr @[email protected] · 2026-01-26 · 13:12 UTC

[Перевод] Тихая смерть robots.txt

Десятки лет robots.txt управлял поведением веб-краулеров. Но сегодня, когда беспринципные ИИ-компании стремятся к получению всё больших объёмов данных, базовый общественный договор веба начинает разваливаться на части. В течение трёх десятков лет крошечный текстовый файл удерживал Интернет от падения в хаос. Этот файл не имел никакого конкретного юридического или технического веса, и даже был не особо сложным. Он представляет собой скреплённый рукопожатием договор между первопроходцами Интернета о том, что они уважают пожелания друг друга и строят Интернет так, чтобы от этого выигрывали все. Это мини-конституция Интернета, записанная в коде. Файл называется robots.txt; обычно он находится по адресу вашвебсайт.com/robots.txt . Этот файл позволяет любому, кто владеет сайтом, будь то мелкий кулинарный блог или многонациональная корпорация, сообщить вебу, что на нём разрешено, а что нет. Какие поисковые движки могут индексировать ваш сайт? Какие архивные проекты могут скачивать и сохранять версии страницы? Могут ли конкуренты отслеживать ваши страницы? Вы сами решаете и объявляете об этом вебу. Эта система неидеальна, но она работает. Ну, или, по крайней мере, работала. Десятки лет основной целью robots.txt были поисковые движки; владелец позволял выполнять скрейпинг, а в ответ они обещали привести на сайт пользователей. Сегодня это уравнение изменилось из-за ИИ: компании всего мира используют сайты и их данные для коллекционирования огромных датасетов обучающих данных, чтобы создавать модели и продукты, которые могут вообще не признавать существование первоисточников. Файл robots.txt работает по принципу «ты — мне, я — тебе», но у очень многих людей сложилось впечатление, что ИИ-компании любят только брать. Cегодня в ИИ вбухано так много денег, а технологический прогресс идёт вперёд так быстро, что многие владельцы сайтов за ним не поспевают. И фундаментальный договор, лежащий в основе robots.txt и веба в целом, возможно, тоже утрачивает свою силу.

https://habr.com/ru/companies/ruvds/articles/987416/

#robotstxt #вебкраулер #crawling #openai #ruvds_перевод

#ruvds_перевод #openai #crawling #вебкраулер #robotstxt

teufelswerk @[email protected] · 2026-01-24 · 10:46 UTC

Robots.txt Generator - Retro Terminal Edition - Mehr als 200 Bots in der kostenfreien Version. Pures HTML, Javascript und ein bisschen CSS. Keine Third Parties, kein Framework, kein CDN, keine Cookies, kein Tracking, keine Werbung, kein BigTech-Gedönse, keine KI, sehr datenschutzfreundlich. Simple und effektiv im Retro-Style. Demnächst online.

#teufelswerk #HTML #javascript #app #entwicklung #code #retro #css #robotstxt #generator #stopbots #bots #crawler #scraper #keineKI #cookieless #datenschutz

#teufelswerk #html #javascript #app #entwicklung #code

Frontend Dogma @[email protected] · 2026-01-22 · 19:45 UTC

Generative AI, by @christianliebel and @yash-vekaria.bsky.social and others (@httparchive.org):

https://almanac.httparchive.org/en/2025/generative-ai

#webalmanac #studies #research #metrics #ai #robotstxt #llmstxt

#llmstxt #robotstxt #ai #metrics #research #studies

Maciek @[email protected] · 2026-01-18 · 23:55 UTC

https://contentsignals.org is a nice idea, but if my reading of RFC 9309 is correct, it might lead to agent-specific blocks being ineffective for bots that don't recognise content signals, because in case of multiple sections of robots.txt matching, the "allow" rules take precedence over the "disallow" rules.

#robotsTxt

#robotstxt

Leonardo Di Ottio @[email protected] · 2026-01-12 · 19:30 UTC

@piccalilli My (admittedly cynical) assumption is that they will still hoover up anything they can find on your site, they’re just no longer showing it to anyone outside Google.

#Google #RobotsTxt #SEO

#google #robotstxt #seo

Mikel - Covivienda rural Bioketa @[email protected] · 2026-01-01 · 09:53 UTC

🌐🌿 Sustainable web practices:

Disallowing web crawlers? Only allowing the most 2-3 sustainable web crawlers? Only getting visitors from direct recommendations? Is editing robots.txt enough?

What do you think?

#noBot #noBigTech #searchEngine #AICrawler #robotsTxt #sustainability #lowTech #solarPunk #slowWeb #smallWeb

#nobot #nobigtech #searchengine #aicrawler #robotstxt #sustainability

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#google #reddit #serpapi #anticircumvention #circumvention #copyright

BlablaLinux @[email protected] · 2025-12-23 · 22:48 UTC

👉 Retrouve les configurations pour mes 15 services (WordPress, Mastodon, Gitea...) ici : 🔗 https://wiki.blablalinux.be/fr/gestion-centralisee-robots-txt-nginx-proxy-manager

C'est cadeau, c'est du partage, et c'est sur le Wiki ! 🐧🚀

#BlablaLinux #SysAdmin #SelfHosted #NPM #RobotsTxt #OpenSource #LogicielLibre

#blablalinux #sysadmin #selfhosted #npm #robotstxt #opensource

Marcel SIneM(S)US @[email protected] · 2025-12-22 · 16:07 UTC

#RSL 1.0 statt robots.txt: Neuer Standard für Internet-Inhalte | heise online https://www.heise.de/news/RSL-1-0-Standard-soll-Verwendung-von-Inhalten-regeln-11111422.html #searchengines #searchengine #ArtificialIntelligence #crawler #ReallySimpleLicensing #robotsTXT

#rsl #searchengines #searchengine #artificialintelligence #crawler #reallysimplelicensing

Jonathan Bailey @[email protected] · 2025-12-17 · 21:29 UTC

Robots.txt has had a good 30+ year run, but it's time to realize that it's not just losing relevance, it's dying. AI companies ultimately are what killed it.

https://www.plagiarismtoday.com/2025/12/17/the-death-of-robots-txt/

#Copyright #DMCA #AI #RobotsTXT #Scraping

#copyright #dmca #ai #robotstxt #scraping