#webcrawlers — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #webcrawlers, aggregated by home.social.
-
FYI: Google rewrites Googlebot's rulebook: 2MB limits, IP moves, and what crawlers really are: Google today published two blog posts revealing Googlebot's true architecture as a shared SaaS platform, a 2MB fetch limit, and a new IP ranges directory path. https://ppc.land/google-rewrites-googlebots-rulebook-2mb-limits-ip-moves-and-what-crawlers-really-are/ #Googlebot #SEO #WebCrawlers #DigitalMarketing #SaaS
-
Google rewrites Googlebot's rulebook: 2MB limits, IP moves, and what crawlers really are: Google today published two blog posts revealing Googlebot's true architecture as a shared SaaS platform, a 2MB fetch limit, and a new IP ranges directory path. https://ppc.land/google-rewrites-googlebots-rulebook-2mb-limits-ip-moves-and-what-crawlers-really-are/ #Google #Googlebot #SEO #WebCrawlers #DigitalMarketing
-
Facebook's Fascination with My Robots.txt
https://blog.nytsoi.net/2026/02/23/facebook-robots-txt
#HackerNews #Facebook #RobotsTxt #SocialMedia #TechNews #WebCrawlers
-
NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]
https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/ -
NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]
https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/ -
NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]
https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/ -
NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]
https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/ -
NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]
https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/ -
How I protect my Forgejo instance from AI web crawlers
https://her.esy.fun/posts/0031-how-i-protect-my-forgejo-instance-from-ai-web-crawlers/index.html
#HackerNews #AIProtection #Forgejo #WebCrawlers #Cybersecurity #TechTips
-
Picknick an der Datenautobahn
Diese Woche wurde ich von einer ungewöhnlichen Welle an Anfragen an meinen Server überrascht. Erst dachte ich, dass ich irgendetwas falsch konfiguriert haben könnte, aber nach einem Gespräch mit dem Support von Uberspace war klar, dass mein WordPress Multisite-Setup mit dieser Seite Gefährliches Halbwissen und Um' Pudding bombardiert und somit überlastet wird. Als einfacher User eines Shared Hosting Dienstes kann man da wenig dagegen tun, außer zu versuchen herauszufinden, was genau passiert und zugucken, wie die Seite auseinandergenommen wird. Witzigerweise musste ich dabei an ein Buch denken, welches ich 2017 gelesen habe.https://niklasbarning.de/2025/12/02/picknick-an-der-datenautobahn/
-
Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”
-
Old man yells at clouds 🤖🌩️: A crusty diatribe against #AI web crawlers that somehow morphs into a shameless plug for server hosting services. Who knew bots had better things to do than read a geriatric infomercial? 🚫📡
https://www.mythic-beasts.com/blog/2025/04/01/abusive-ai-web-crawlers-get-off-my-lawn/ #webcrawlers #infomercial #serverhosting #HackerNews #HackerNews #ngated -
ICYMI: Google updates crawler verification processes with daily IP range refreshes: Enhanced security measures help website owners identify legitimate Google web crawlers and protect against potential imposters. https://ppc.land/google-updates-crawler-verification-processes-with-daily-ip-range-refreshes/ #GoogleUpdates #WebCrawlers #SEO #CyberSecurity #DigitalMarketing
-
Perishable Press: Ultimate Block List to Stop AI Bots. “The focus of this post is aimed at website owners who want to stop AI bots from crawling their web pages, as much as possible. To help people with this, I’ve been collecting data and researching AI bots for many months now, and have put together a ‘Mega Block List’ to help stop AI bots from devouring your content.”
https://rbfirehose.com/2025/02/12/perishable-press-ultimate-block-list-to-stop-ai-bots/
-
Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]
-
A GitHub-hosted project offers a curated robots.txt file designed to block known AI crawlers from accessing website content #AI #WebCrawlers #GitHub #airobots #DataPrivacy #LLMs #Devs #AITraining #RobotsTxt #DigitalRights #Copyright
-
A GitHub-hosted project offers a curated robots.txt file designed to block known AI crawlers from accessing website content #AI #WebCrawlers #GitHub #airobots #DataPrivacy #LLMs #Devs #AITraining #RobotsTxt #DigitalRights #Copyright
-
A GitHub-hosted project offers a curated robots.txt file designed to block known AI crawlers from accessing website content #AI #WebCrawlers #GitHub #airobots #DataPrivacy #LLMs #Devs #AITraining #RobotsTxt #DigitalRights #Copyright
-
😤 #Scraperbots are automating data theft, extracting your website's content without permission! 🌐
💣 Learn about the impact of scraper bots and how to prevent them: https://bit.ly/3RiXgya
#contentscraping #bots #webscrapers #webcrawlers #scraping #waf #botmanagement #waap #scrapingbots #apptrana #indusface
-
😤 #Scraperbots are automating data theft, extracting your website's content without permission! 🌐
💣 Learn about the impact of scraper bots and how to prevent them: https://bit.ly/3RiXgya
#contentscraping #bots #webscrapers #webcrawlers #scraping #waf #botmanagement #waap #scrapingbots #apptrana #indusface
-
😤 #Scraperbots are automating data theft, extracting your website's content without permission! 🌐
💣 Learn about the impact of scraper bots and how to prevent them: https://bit.ly/3RiXgya
#contentscraping #bots #webscrapers #webcrawlers #scraping #waf #botmanagement #waap #scrapingbots #apptrana #indusface
-
What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?
I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md
Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler