#webcrawlers — Public Fediverse posts on home.social

PPC Land @[email protected] · 2026-04-03 · 20:40 UTC

FYI: Google rewrites Googlebot's rulebook: 2MB limits, IP moves, and what crawlers really are: Google today published two blog posts revealing Googlebot's true architecture as a shared SaaS platform, a 2MB fetch limit, and a new IP ranges directory path. https://ppc.land/google-rewrites-googlebots-rulebook-2mb-limits-ip-moves-and-what-crawlers-really-are/ #Googlebot #SEO #WebCrawlers #DigitalMarketing #SaaS

#googlebot #seo #webcrawlers #digitalmarketing #saas

PPC Land @[email protected] · 2026-03-31 · 20:38 UTC

Google rewrites Googlebot's rulebook: 2MB limits, IP moves, and what crawlers really are: Google today published two blog posts revealing Googlebot's true architecture as a shared SaaS platform, a 2MB fetch limit, and a new IP ranges directory path. https://ppc.land/google-rewrites-googlebots-rulebook-2mb-limits-ip-moves-and-what-crawlers-really-are/ #Google #Googlebot #SEO #WebCrawlers #DigitalMarketing

#google #googlebot #seo #webcrawlers #digitalmarketing

Hacker News @[email protected] · 2026-02-23 · 13:10 UTC

Facebook's Fascination with My Robots.txt

https://blog.nytsoi.net/2026/02/23/facebook-robots-txt

#HackerNews #Facebook #RobotsTxt #SocialMedia #TechNews #WebCrawlers

#hackernews #facebook #robotstxt #socialmedia #technews #webcrawlers

ResearchBuzz: Firehose @[email protected] · 2026-01-30 · 14:44 UTC

NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]

https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

#aitraining #blockedcontent #borndigitalarchiving #internetarchive #journalism #media

ResearchBuzz: Firehose @[email protected] · 2026-01-30 · 14:44 UTC

NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]

https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

#aitraining #blockedcontent #borndigitalarchiving #internetarchive #journalism #media

ResearchBuzz: Firehose @[email protected] · 2026-01-30 · 14:44 UTC

NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]

https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

#aitraining #blockedcontent #borndigitalarchiving #internetarchive #journalism #media

ResearchBuzz: Firehose @[email protected] · 2026-01-30 · 14:44 UTC

NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]

https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

#webscraping #webcrawlers #trainingai #theguardian #media #journalism

ResearchBuzz: Firehose @[email protected] · 2026-01-30 · 14:44 UTC

NiemanLab: News publishers limit Internet Archive access due to AI scraping concerns. “When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance […]

https://rbfirehose.com/2026/01/30/niemanlab-news-publishers-limit-internet-archive-access-due-to-ai-scraping-concerns/

#aitraining #blockedcontent #borndigitalarchiving #internetarchive #journalism #media

Hacker News @[email protected] · 2025-12-22 · 08:24 UTC

How I protect my Forgejo instance from AI web crawlers

https://her.esy.fun/posts/0031-how-i-protect-my-forgejo-instance-from-ai-web-crawlers/index.html

#HackerNews #AIProtection #Forgejo #WebCrawlers #Cybersecurity #TechTips

#hackernews #aiprotection #forgejo #webcrawlers #cybersecurity #techtips

Niklas Barning @[email protected] · 2025-12-02 · 19:29 UTC

Picknick an der Datenautobahn

Diese Woche wurde ich von einer ungewöhnlichen Welle an Anfragen an meinen Server überrascht. Erst dachte ich, dass ich irgendetwas falsch konfiguriert haben könnte, aber nach einem Gespräch mit dem Support von Uberspace war klar, dass mein WordPress Multisite-Setup mit dieser Seite Gefährliches Halbwissen und Um' Pudding bombardiert und somit überlastet wird. Als einfacher User eines Shared Hosting Dienstes kann man da wenig dagegen tun, außer zu versuchen herauszufinden, was genau passiert und zugucken, wie die Seite auseinandergenommen wird. Witzigerweise musste ich dabei an ein Buch denken, welches ich 2017 gelesen habe.

https://niklasbarning.de/2025/12/02/picknick-an-der-datenautobahn/

#ai #bottraffic #crawler #ki #monitoring #selfhosting

ResearchBuzz: Firehose @[email protected] · 2025-12-01 · 14:19 UTC

Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

#bots #openai #webcrawlers #webcrawling #webscraping

N-gated Hacker News @[email protected] · 2025-04-02 · 13:56 UTC

Old man yells at clouds 🤖🌩️: A crusty diatribe against #AI web crawlers that somehow morphs into a shameless plug for server hosting services. Who knew bots had better things to do than read a geriatric infomercial? 🚫📡
https://www.mythic-beasts.com/blog/2025/04/01/abusive-ai-web-crawlers-get-off-my-lawn/ #webcrawlers #infomercial #serverhosting #HackerNews #HackerNews #ngated

#ai #webcrawlers #infomercial #serverhosting #hackernews #ngated

PPC Land @[email protected] · 2025-03-19 · 05:18 UTC

ICYMI: Google updates crawler verification processes with daily IP range refreshes: Enhanced security measures help website owners identify legitimate Google web crawlers and protect against potential imposters. https://ppc.land/google-updates-crawler-verification-processes-with-daily-ip-range-refreshes/ #GoogleUpdates #WebCrawlers #SEO #CyberSecurity #DigitalMarketing

#googleupdates #webcrawlers #seo #cybersecurity #digitalmarketing

ResearchBuzz: Firehose @[email protected] · 2025-02-12 · 16:08 UTC

Perishable Press: Ultimate Block List to Stop AI Bots. “The focus of this post is aimed at website owners who want to stop AI bots from crawling their web pages, as much as possible. To help people with this, I’ve been collecting data and researching AI bots for many months now, and have put together a ‘Mega Block List’ to help stop AI bots from devouring your content.”

https://rbfirehose.com/2025/02/12/perishable-press-ultimate-block-list-to-stop-ai-bots/

#ai #aitraining #botblocking #bots #trainingai #webcrawlers

ResearchBuzz: Firehose @[email protected] · 2025-01-24 · 14:28 UTC

Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

#ai #aitraining #foilingai #honeytraps #trainingai #webcrawlers

Winbuzzer @[email protected] · 2025-01-14 · 15:30 UTC

A GitHub-hosted project offers a curated robots.txt file designed to block known AI crawlers from accessing website content #AI #WebCrawlers #GitHub #airobots #DataPrivacy #LLMs #Devs #AITraining #RobotsTxt #DigitalRights #Copyright

https://winbuzzer.com/2025/01/14/github-project-offers-to-block-all-known-ai-web-crawlers-via-robots-txt-xcxwbn/

#ai #webcrawlers #github #airobots #dataprivacy #llms

Winbuzzer @[email protected] · 2025-01-14 · 15:30 UTC

A GitHub-hosted project offers a curated robots.txt file designed to block known AI crawlers from accessing website content #AI #WebCrawlers #GitHub #airobots #DataPrivacy #LLMs #Devs #AITraining #RobotsTxt #DigitalRights #Copyright

https://winbuzzer.com/2025/01/14/github-project-offers-to-block-all-known-ai-web-crawlers-via-robots-txt-xcxwbn/

#ai #webcrawlers #github #airobots #dataprivacy #llms

Winbuzzer @[email protected] · 2025-01-14 · 15:30 UTC

A GitHub-hosted project offers a curated robots.txt file designed to block known AI crawlers from accessing website content #AI #WebCrawlers #GitHub #airobots #DataPrivacy #LLMs #Devs #AITraining #RobotsTxt #DigitalRights #Copyright

https://winbuzzer.com/2025/01/14/github-project-offers-to-block-all-known-ai-web-crawlers-via-robots-txt-xcxwbn/

#ai #webcrawlers #github #airobots #dataprivacy #llms

Indusface @[email protected] · 2024-06-10 · 07:38 UTC

😤 #Scraperbots are automating data theft, extracting your website's content without permission! 🌐

💣 Learn about the impact of scraper bots and how to prevent them: https://bit.ly/3RiXgya

#contentscraping #bots #webscrapers #webcrawlers #scraping #waf #botmanagement #waap #scrapingbots #apptrana #indusface

#scraperbots #contentscraping #bots #webscrapers #webcrawlers #scraping

Indusface @[email protected] · 2024-06-10 · 07:38 UTC

😤 #Scraperbots are automating data theft, extracting your website's content without permission! 🌐

💣 Learn about the impact of scraper bots and how to prevent them: https://bit.ly/3RiXgya

#contentscraping #bots #webscrapers #webcrawlers #scraping #waf #botmanagement #waap #scrapingbots #apptrana #indusface

#scraperbots #contentscraping #bots #webscrapers #webcrawlers #scraping

Indusface @[email protected] · 2024-06-10 · 07:38 UTC

😤 #Scraperbots are automating data theft, extracting your website's content without permission! 🌐

💣 Learn about the impact of scraper bots and how to prevent them: https://bit.ly/3RiXgya

#contentscraping #bots #webscrapers #webcrawlers #scraping #waf #botmanagement #waap #scrapingbots #apptrana #indusface

#scraperbots #contentscraping #bots #webscrapers #webcrawlers #scraping

Dave Mackey @[email protected] · 2023-04-16 · 12:34 UTC

What are your favorite / the best #WebCrawlers for broad / #WebScale #crawling?

I've built a list but am looking for anything I missed: https://github.com/davidshq/awesome-search-engines/blob/main/WebCrawlers.md

Main options I've found include #Apache #Nutch, #StormCrawler, #Scrapy, #Norconex, #PulsarR, #Heritrix, and #sparkler

#question #search #SearchEngines

#webcrawlers #webscale #crawling #apache #nutch #stormcrawler