#robotstxt — Public Fediverse posts on home.social

Habr @[email protected] · 2026-04-25 · 14:42 UTC

Пять неочевидных вещей, которые я узнал, запуская кино-соцсеть: от robots.txt-ловушки до 24-мерной математики вкуса

Последние полгода я работаю над VibeMuvik — кино-соцсетью с рецензиями, дебатами и синхронным просмотром фильмов. Одна из тех штук, которые «ну вроде несложно», пока не начинаешь копать. Эта статья — про неожиданные находки . Не про «как я выбрал стек» (скучно) и не про «туториал по WebRTC» (и без меня есть). Это пять ситуаций, в которых я споткнулся, обнаружил что-то интересное, и подумал «об этом стоит рассказать — другим пригодится». Поехали.

https://habr.com/ru/articles/1027876/

#robotstxt #SEO #WebRTC #Nextjs #IndexNow #sitemap #Googlebot #Cinema_DNA #синхронный_просмотр #рекомендательные_системы

#рекомендательные_системы #синхронный_просмотр #cinema_dna #googlebot #sitemap #indexnow

WIRED - The Latest in Technology, Science, Culture and Business [Unofficial] @[email protected] · 2026-04-22 · 09:30 UTC

The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims

https://fed.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/

#culture #culturedigitalculture #artificialintelligence #socialmedia #twitter #reddit

WIRED - The Latest in Technology, Science, Culture and Business [Unofficial] @[email protected] · 2026-04-22 · 09:30 UTC

The Pope’s Warnings About AI Were AI-Generated, a Detection Tool Claims

https://web.brid.gy/r/https://www.wired.com/story/pope-tweets-ai-generated-pangram-chrome-extension/

#culture #culturedigitalculture #artificialintelligence #socialmedia #twitter #reddit

Inautilo @[email protected] · 2026-04-21 · 19:59 UTC

#Development #Launches
Is Your Site Agent-Ready? · Scan your website for agent-friendly standards https://ilo.im/16c93a

_____
#Website #AI #Agents #MCP #Commerce #Content #RobotsTxt #Sitemap #WebDev #Frontend

#development #launches #website #ai #agents #mcp

PPC Land @[email protected] · 2026-04-05 · 14:36 UTC

Only 7.4% of Fortune 500 have an llms.txt file, study finds: ProGEO.ai research reveals just 7.4% of Fortune 500 companies have implemented llms.txt, while 92.8% use robots.txt and 53.8% use JSON-LD for AI visibility. https://ppc.land/only-7-4-of-fortune-500-have-an-llms-txt-file-study-finds/ #Fortune500 #AI #llms #robotsTxt #JSONLD

#fortune500 #ai #llms #robotstxt #jsonld

Inautilo @[email protected] · 2026-04-01 · 13:29 UTC

#Development #Explainers
Inside Googlebot · How Google’s crawl system decides which content gets indexed https://ilo.im/16btho

_____
#Business #Google #SearchEngine #SEO #Crawlers #Content #RobotsTxt #Development #WebDev #Frontend

#content #robotstxt #webdev #frontend #development #explainers

C. @[email protected] · 2026-03-28 · 07:01 UTC

Oh, this is #fun.

#Applebot - Apple's web crawler, used for various things - is ignoring robots.txt rules governing crawling of websites.

I have Applebot (and Applebot-Extended, which isn't really a crawler) in my robots.txt files, set to disallow all access. Has been that way for #yonks.

And Applebot is consistently the highest-traffic crawler to my sites - at least of ones that actually bother to fetch robots.txt. Yesterday, for example, Applebot fetched robots.txt from one of my websites almost 800 times.

Yes, it's really Apple, not someone faking the user-agent identifier. It's coming from the networks that Apple says can be used to identify Applebot access. DNS matches, everything.
e.g. https://support.apple.com/en-ca/119829

So: legendary Apple software quality. Documented to do the right thing, but actually doing the wrong thing. And completely failing to cache content, fetching the same file 800 times a day when it hasn't changed in years.

Hey, Apple! Need a software engineer who's actually, you know, good at it? I'm available.

#Apple #AppleInc #TimApple #WebCrawler #RobotsTxt #quality #WeveHeardOfIt #qwality #AppleQwality #legendary #TwoHardThings #caching #fail #engineer #software #SoftwareEngineer

#fun #applebot #yonks #apple #appleinc #timapple

Inautilo @[email protected] · 2026-03-10 · 14:45 UTC

#Development #Findings
Markdown, llms.txt, and AI crawlers · Do Markdown and llms.txt matter for your website? https://ilo.im/16b5qb

_____
#Business #SEO #SearchEngines #AI #Crawlers #Content #Website #Markdown #LlmsTxt #RobotsTxt

#development #findings #business #seo #searchengines #ai

Habr @[email protected] · 2026-03-03 · 06:02 UTC

ИИ уже читает ваш сайт, но по каким правилам? LLMs.txt, robots.txt и контроль агентов

Еще пару лет назад веб жил в простой и понятной модели: есть сайты, есть поисковые роботы, есть пользователи. Роботы приходят, сканируют страницы, кладут их в индекс — дальше начинается привычная борьба за позиции в выдаче. Эта логика десятилетиями определяла, как мы строим сайты, настраиваем SEO и пишем robots.txt. С появлением LLM-агентов эта модель начала трещать по швам.

https://habr.com/ru/articles/1004924/

#robotstxt #llmstxt #llms #llmsfulltxt #yandex #google

#google #yandex #llmsfulltxt #llms #llmstxt #robotstxt

Habr @[email protected] · 2026-01-26 · 13:12 UTC

[Перевод] Тихая смерть robots.txt

Десятки лет robots.txt управлял поведением веб-краулеров. Но сегодня, когда беспринципные ИИ-компании стремятся к получению всё больших объёмов данных, базовый общественный договор веба начинает разваливаться на части. В течение трёх десятков лет крошечный текстовый файл удерживал Интернет от падения в хаос. Этот файл не имел никакого конкретного юридического или технического веса, и даже был не особо сложным. Он представляет собой скреплённый рукопожатием договор между первопроходцами Интернета о том, что они уважают пожелания друг друга и строят Интернет так, чтобы от этого выигрывали все. Это мини-конституция Интернета, записанная в коде. Файл называется robots.txt; обычно он находится по адресу вашвебсайт.com/robots.txt . Этот файл позволяет любому, кто владеет сайтом, будь то мелкий кулинарный блог или многонациональная корпорация, сообщить вебу, что на нём разрешено, а что нет. Какие поисковые движки могут индексировать ваш сайт? Какие архивные проекты могут скачивать и сохранять версии страницы? Могут ли конкуренты отслеживать ваши страницы? Вы сами решаете и объявляете об этом вебу. Эта система неидеальна, но она работает. Ну, или, по крайней мере, работала. Десятки лет основной целью robots.txt были поисковые движки; владелец позволял выполнять скрейпинг, а в ответ они обещали привести на сайт пользователей. Сегодня это уравнение изменилось из-за ИИ: компании всего мира используют сайты и их данные для коллекционирования огромных датасетов обучающих данных, чтобы создавать модели и продукты, которые могут вообще не признавать существование первоисточников. Файл robots.txt работает по принципу «ты — мне, я — тебе», но у очень многих людей сложилось впечатление, что ИИ-компании любят только брать. Cегодня в ИИ вбухано так много денег, а технологический прогресс идёт вперёд так быстро, что многие владельцы сайтов за ним не поспевают. И фундаментальный договор, лежащий в основе robots.txt и веба в целом, возможно, тоже утрачивает свою силу.

https://habr.com/ru/companies/ruvds/articles/987416/

#robotstxt #вебкраулер #crawling #openai #ruvds_перевод

#ruvds_перевод #openai #crawling #вебкраулер #robotstxt

Frontend Dogma @[email protected] · 2026-01-22 · 19:45 UTC

Generative AI, by @christianliebel and @yash-vekaria.bsky.social and others (@httparchive.org):

https://almanac.httparchive.org/en/2025/generative-ai

#webalmanac #studies #research #metrics #ai #robotstxt #llmstxt

#llmstxt #robotstxt #ai #metrics #research #studies

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#google #reddit #serpapi #anticircumvention #circumvention #copyright

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#google #reddit #serpapi #anticircumvention #circumvention #copyright

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#webcrawling #robotstxt #openweb #licensing #dmca1201 #copyright

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#google #reddit #serpapi #anticircumvention #circumvention #copyright

Marcel SIneM(S)US @[email protected] · 2025-12-22 · 16:07 UTC

#RSL 1.0 statt robots.txt: Neuer Standard für Internet-Inhalte | heise online https://www.heise.de/news/RSL-1-0-Standard-soll-Verwendung-von-Inhalten-regeln-11111422.html #searchengines #searchengine #ArtificialIntelligence #crawler #ReallySimpleLicensing #robotsTXT

#rsl #searchengines #searchengine #artificialintelligence #crawler #reallysimplelicensing

Le site de Korben [Unofficial] @[email protected] · 2025-12-15 · 12:11 UTC

Comment bloquer les crawlers IA qui pillent votre site sans vous demander la permission ?

https://fed.brid.gy/r/https://korben.info/bloquer-crawlers-ia-robots-txt-htaccess-nginx.html

#linuxopensourceadministrationserveur #linuxopensourcelogicielslibres #apache #claudebot #crawlersia #gptbot

DrWeb @[email protected] · 2025-12-07 · 20:25 UTC

The New York Times sues Perplexity for producing ‘verbatim’ copies of its work – The Verge

Credit: NYT Times, gettyimages-2249036304

The New York Times sues Perplexity for producing ‘verbatim’ copies of its work

The NYT alleges Perplexity ‘unlawfully crawls, scrapes, copies, and distributes’ work from its website.

by Emma Roth, Dec 5, 2025, 7:42 AM PS, Emma Roth is a news writer who covers the streaming wars, consumer tech, crypto, social media, and much more. Previously, she was a writer and editor at MUO.

The New York Times has escalated its legal battle against the AI startup Perplexity, as it’s now suing the AI “answer engine” for allegedly producing and profiting from responses that are “verbatim or substantially similar copies” of the publication’s work.

The lawsuit, filed in a New York federal court on Friday, claims Perplexity “unlawfully crawls, scrapes, copies, and distributes” content from the NYT. It comes after the outlet’s repeated demands for Perplexity to stop using content from its website, as the NYT sent cease-and-desist notices to the AI startup last year and most recently in July, according to the lawsuit. The Chicago Tribune also filed a copyright lawsuit against Perplexity on Thursday.

The New York Times sued OpenAI for copyright infringement in December 2023, and later inked a deal with Amazon, bringing its content to products like Alexa.

Perplexity became the subject of several lawsuits after reporting from Forbes and Wired revealed that the startup had been skirting websites’ paywalls to provide AI-generated summaries — and in some cases, copies — of their work. TheNYT makes similar accusations in its lawsuit, stating that Perplexity’s crawlers “have intentionally ignored or evaded technical content protection measures,” such as the robots.txt file, which indicates the parts of a website crawlers can access.

Perplexity attempted to smooth things over by launching a program to share ad revenue with publishers last year, which it later expanded to include its Comet web browser in August.

Related

“By copying The Times’s copyrighted content and creating substitutive output derived from its works, obviating the need for users to visit The Times’s website or purchase its newspaper, Perplexity is misappropriating substantial subscription, advertising, licensing, and affiliate revenue opportunities that belong rightfully and exclusively to The Times,” the lawsuit states.

Continue/Read Original Article Here: The New York Times sues Perplexity for producing ‘verbatim’ copies of its work | The Verge

Tags: AI, artificial intelligence, Copyright, Crawlers, Distribution, Lawsuit, NYT Work, OpenAI, Perplexity, Robots.txt, Scrapping, Sues, The New York Times, The Verge, Verbatim Copies

#AI #artificialIntelligence #Copyright #Crawlers #Distribution #Lawsuit #NYTWork #OpenAI #Perplexity #RobotsTxt #Scrapping #Sues #TheNewYorkTimes #TheVerge #VerbatimCopies

#ai #artificialintelligence #copyright #crawlers #distribution #lawsuit

Inautilo @[email protected] · 2025-11-14 · 11:05 UTC

#Development #Approaches
Rate-limiting requests with Nginx · An alternative approach to counter AI crawlers https://ilo.im/168axr

_____
#RateLimiting #Nginx #WebServer #AI #Scrapers #RobotsTxt #DevOps #WebDev #Backend

#development #approaches #ratelimiting #nginx #webserver #ai

Winbuzzer @[email protected] · 2025-10-06 · 13:08 UTC

Cloudflare Overhauls Web’s AI Rulebook with New Robots.txt ‘Content Signals’

#AI #Cloudflare #RobotsTxt #DataScraping #Publishing #GenerativeAI

https://winbuzzer.com/2025/10/06/cloudflare-overhauls-webs-ai-rulebook-with-new-robots-txt-content-signals-xcxwbn

#ai #cloudflare #robotstxt #datascraping #publishing #generativeai

NERDS.xyz – Real Tech News for Real Nerds [Unofficial] @[email protected] · 2025-09-24 · 13:29 UTC

Cloudflare launches Content Signals Policy to fight AI crawlers and scrapers

https://web.brid.gy/r/https://nerds.xyz/2025/09/cloudflare-content-signals-policy-ai-crawlers/

#artificialintelligence #aicrawlers #aitraining #cloudflare #contentsignalspolicy #datascraping

NERDS.xyz – Real Tech News for Real Nerds [Unofficial] @[email protected] · 2025-09-24 · 13:29 UTC

Cloudflare launches Content Signals Policy to fight AI crawlers and scrapers

https://web.brid.gy/r/https://nerds.xyz/2025/09/cloudflare-content-signals-policy-ai-crawlers/

#artificialintelligence #aicrawlers #aitraining #cloudflare #contentsignalspolicy #datascraping

NERDS.xyz – Real Tech News for Real Nerds [Unofficial] @[email protected] · 2025-09-24 · 13:29 UTC

Cloudflare launches Content Signals Policy to fight AI crawlers and scrapers

https://web.brid.gy/r/https://nerds.xyz/2025/09/cloudflare-content-signals-policy-ai-crawlers/

#artificialintelligence #aicrawlers #aitraining #cloudflare #contentsignalspolicy #datascraping

Inautilo @[email protected] · 2025-09-13 · 18:49 UTC

#Business #Initiatives
AI’s free web scraping days may be over · Say hello to RSS’s younger, tougher brother https://ilo.im/166s9q

_____
#Web #Publishing #Website #Blog #Content #AI #Crawlers #Payments #RSL #RSS #RobotsTxt

#business #initiatives #web #publishing #website #blog

tech news ᳇ eicker.news @[email protected] · 2025-09-11 · 09:29 UTC

A new #licensingstandard, #ReallySimpleLicensing (#RSL), aims to allow #webpublishers to set terms for #AI companies using their content. Supported by major brands like Reddit and Yahoo, RSL builds upon the existing #robotstxt protocol, enabling #publishers to specify #licensing and #royaltyterms for #AItraining data. https://www.theverge.com/news/775072/rsl-standard-licensing-ai-publishing-reddit-yahoo-medium?eicker.news #tech #media #news

#licensingstandard #reallysimplelicensing #rsl #webpublishers #ai #robotstxt

Hostvix @[email protected] · 2025-09-11 · 09:27 UTC

RSL is the missing layer for the AI era: set terms, get attribution, and get paid (per crawl or per inference). Open standard, collective leverage. If AI uses your work, it should respect your license. Time to take control.

https://hostvix.com/rsl-a-new-standard-to-make-ai-pay-for-the-content-it-consumes/

#RSL #ReallySimpleLicensing #AI #AIethics #AIsafety #AIdata #ContentRights #Licensing #OpenWeb #RobotsTxt #Publishers #Creators #Attribution #PayPerCrawl #PayPerInference #RSS #WebStandards #DigitalRights #CollectiveLicensing #Fastly

#rsl #reallysimplelicensing #ai #aiethics #aisafety #aidata

teufelswerk @[email protected] · 2025-08-06 · 11:04 UTC

Semrush ist eines der bekanntesten SEO-Analyse-Tools auf dem Markt. Es durchsucht Websites regelmäßig mit seinem Bot (SemrushBot), um Daten wie Keywords, Backlinks, Rankings und vieles mehr von deiner Website zu erfassen und zu analysieren. Hier sind 5 effektive, schnell umzusetzende Methoden, wie du Semrush von deiner Website aussperren kannst. 👇

https://teufelswerk.net/semrushbot-blockieren-so-schuetzt-du-jede-website-egal-ob-wordpress-joomla-typo3-oder-statisch/

#SEO #semrush #botblocker #bots #website #websecurity #cybersecurity #wordpress #joomla #typo3 #nginx #robotstxt #htaccess

#seo #semrush #botblocker #bots #website #websecurity

apfeltalk :verified: @[email protected] · 2025-08-05 · 13:00 UTC

Perplexity ignoriert robots.txt: Kontroverse um Daten-Scraping für KI-Training
Das Training großer Sprachmodelle beruht auf einer Vielzahl von Webdaten. Die Einhaltu
https://www.apfeltalk.de/magazin/news/perplexity-ignoriert-robots-txt-kontroverse-um-daten-scraping-fuer-ki-training/
#News #Apple #Applebot #Cloudflare #Cybersecurity #Datenanalyse #Datensicherheit #EthikInDerKI #KITraining #KnstlicheIntelligenz #OpenWeb #Perplexity #robotstxt #Sprachmodell #WebScraping #WebseitenBetreiber

#news #apple #applebot #cloudflare #cybersecurity #datenanalyse

Inautilo @[email protected] · 2025-07-18 · 13:05 UTC

#Business #Explainers
LLMS.txt isn’t robots.txt · What it is, why it matters, and how to use it https://ilo.im/165du0

_____
#SEO #AI #LlmsTxt #RobotsTxt #SitemapXML #Content #Website #Development #WebDev #Frontend

#business #explainers #seo #ai #llmstxt #robotstxt

Inautilo @[email protected] · 2025-03-31 · 14:05 UTC

#Business #Introductions
Meet LLMs.txt · A proposed standard for AI website content crawling https://ilo.im/16318s

_____
#SEO #GEO #AI #Bots #Crawlers #LlmsTxt #RobotsTxt #Development #WebDev #Backend

#business #introductions #seo #geo #ai #bots

ResearchBuzz: Firehose @[email protected] · 2025-03-29 · 14:53 UTC

Search Engine Land: Meet LLMs.txt, a proposed standard for AI website content crawling. “While many content creators are interested in the proposal’s potential merits, it also has detractors. But given the rapidly changing landscape for content produced in a world of artificial intelligence, llms.txt is certainly worth discussing.”

https://rbfirehose.com/2025/03/29/search-engine-land-meet-llms-txt-a-proposed-standard-for-ai-website-content-crawling/

#ai #aitraining #aiassisted #llmstxt #robotstxt #trainingai

Winbuzzer @[email protected] · 2025-03-26 · 10:29 UTC

AI Crawlers Overwhelm Open-Source Projects, Forcing Developers to Block Entire Countries

#AI #Web #Robotstxt #AIScraping #OpenSource #Cybersecurity #DataScraping #Scraping #WebScraping

https://winbuzzer.com/2025/03/26/ai-crawlers-overwhelm-open-source-projects-forcing-developers-to-block-entire-countries-xcxwbn/

#ai #web #robotstxt #aiscraping #opensource #cybersecurity

Dr Pen @[email protected] · 2025-01-27 · 20:33 UTC

Protecting your blog from the dead eyed #AI crawlers. You can experiment with specific robots txt, and I also run a script in htaccess. I think there are metadata properties you can declare. None of this stops your pages being crawled but may afford some legal protection. (See the German Laion case recently). I'm doing a short blogpost on this, soon.

#robotstxt #aicrawlers #htaccess

#ai #robotstxt #aicrawlers #htaccess

C. @[email protected] · 2024-10-21 · 14:53 UTC

Hey, #webmasters ... just so you know.

#Facebook's new-ish "meta-externalagent" #webcrawler, which they document is for stealing data for their Grand Theft Autocomplete (cough #AI cough), is ignoring robots.txt on my websites.

https://developers.facebook.com/docs/sharing/webmasters/web-crawlers

Is anyone surprised?

#Meta #LLM #scrape #web #copyright #RobotsTXT