home.social

#webcrawling — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #webcrawling, aggregated by home.social.

  1. FYI: OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining: New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%. ppc.land/openai-tripled-its-we #OpenAI #GPT5 #ChatGPT #AItrends #webcrawling

  2. FYI: OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining: New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%. ppc.land/openai-tripled-its-we #OpenAI #GPT5 #ChatGPT #AItrends #webcrawling

  3. FYI: OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining: New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%. ppc.land/openai-tripled-its-we #OpenAI #GPT5 #ChatGPT #AItrends #webcrawling

  4. FYI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. ppc.land/google-agent-joins-th #GoogleAgent #AIBrowsing #UserAgent #WebCrawling #ProjectMariner

  5. FYI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. ppc.land/google-agent-joins-th #GoogleAgent #AIBrowsing #UserAgent #WebCrawling #ProjectMariner

  6. FYI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. ppc.land/google-agent-joins-th #GoogleAgent #AIBrowsing #UserAgent #WebCrawling #ProjectMariner

  7. FYI: Googlebot is not a program - Google engineers finally explain what it really is: Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints. ppc.land/googlebot-is-not-a-pr #Googlebot #SEO #WebCrawling #DigitalMarketing #SaaS

  8. Googlebot is not a program - Google engineers finally explain what it really is: Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints. ppc.land/googlebot-is-not-a-pr #Googlebot #SEO #WebCrawling #SaaS #DigitalMarketing

  9. FYI: Google's secret crawl logic, finally explained in one page: Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing. ppc.land/googles-secret-crawl- #Google #SEO #WebCrawling #Googlebot #DigitalMarketing

  10. ICYMI: Google's secret crawl logic, finally explained in one page: Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing. ppc.land/googles-secret-crawl- #Google #SEO #WebCrawling #Googlebot #DigitalMarketing

  11. Google's secret crawl logic, finally explained in one page: Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing. ppc.land/googles-secret-crawl- #Google #SEO #WebCrawling #Googlebot #DigitalMarketing

  12. Smart TVs are now running Bright SDK to silently crawl the web for AI training, using residential proxies to bypass Google policy. The move sparks a compliance backlash and raises privacy concerns for home devices. How will regulators respond, and what does this mean for open‑source AI? Dive into the details. #SmartTV #BrightSDK #WebCrawling #DeviceCompliance

    🔗 aidailypost.com/news/smart-tvs

  13. Smart TVs are now running Bright SDK to silently crawl the web for AI training, using residential proxies to bypass Google policy. The move sparks a compliance backlash and raises privacy concerns for home devices. How will regulators respond, and what does this mean for open‑source AI? Dive into the details. #SmartTV #BrightSDK #WebCrawling #DeviceCompliance

    🔗 aidailypost.com/news/smart-tvs

  14. Smart TVs are now running Bright SDK to silently crawl the web for AI training, using residential proxies to bypass Google policy. The move sparks a compliance backlash and raises privacy concerns for home devices. How will regulators respond, and what does this mean for open‑source AI? Dive into the details. #SmartTV #BrightSDK #WebCrawling #DeviceCompliance

    🔗 aidailypost.com/news/smart-tvs

  15. Ah, #wxpath, because using #XPath was just too easy before 🙄. Now with extra layers of #complexity, just in case you weren't already confused enough by web crawling! 🕸️😵‍💫
    github.com/rodricios/wxpath #webcrawling #technews #developerhumor #HackerNews #ngated

  16. Cloudflare's 2025 data reveals Google's structural advantage in AI training: Googlebot crawled 11.6% of web pages vs OpenAI's 3.6%. Publishers face an impossible choice - they can't block Google's AI crawling without losing search visibility entirely, since the same bot handles both functions. #AI #WebCrawling #DigitalRights

    implicator.ai/googles-quiet-co

  17. Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

    https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

  18. Released scrapy-contrib-bigexporter 1.0.0 (codeberg.org/ZuInnoTe/scrapy-c) - additional export formats for the webscraping framework Scrapy.

    Migrated parquet export from fastparquet to pyarrow as fastparquet is deprecated (docs.dask.org/en/stable/change)

    Migrated orc export from pyorc to pyarrow to reduce the number of dependencies

    #scrapy #crawling #python #parquet #orc #pyarrow #webcrawling #scraping

  19. Remember when 'robots.txt' was supposed to solve all our crawling problems? Online media brands are trying a new protocol to deter 'unwanted' AI crawlers. Because clearly, we need more digital fences. What's your bet on how long it takes for a savvy AI to find a workaround?

    Read more: cnet.com/tech/services-and-sof

    #AI #TechNews #WebCrawling #DigitalRights #Privacy

  20. Search Engine Land: Google fixes reduced crawling issue impacting some websites. “Google has confirmed it fixed an issue with its crawlers impacting ‘some sites.’ The issue was ‘reduced / fluctuating crawling’ from Google’s end with Googlebot. It is now resolved and Google said the crawling should pick back up in the near future.”

    https://rbfirehose.com/2025/08/31/search-engine-land-google-fixes-reduced-crawling-issue-impacting-some-websites/

  21. It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?

    I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?

    I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?

    We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?

    To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.

    Some concerning #news articles on the topic:

    #webcrawling #crawler #web #opensource

  22. Sites scramble to block ChatGPT web crawler after instructions emerge - Enlarge (credit: Getty Images)

    Without announcement, OpenAI re... - arstechnica.com/?p=1960108 #machinelearning #webscraming #webcrawling #aiethics #chatgpt #chatgtp #biz#gptbot #openai #tech #ai

  23. "Perplexity’s accusations aren’t exactly fair, either. One argument that Prince and Cloudflare used for calling out Perplexity’s methods was that OpenAI doesn’t behave in the same way.

    “OpenAI is an example of a leading AI company that follows these best practices,” Cloudflare wrote. “They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.”

    Web Bot Auth is a Cloudflare-supported standard being developed by the Internet Engineering Task Force that hopes to create a cryptographic method for identifying AI agent web requests.

    The debate comes as bot activity reshapes the internet. As TechCrunch has previously reported, bots seeking to scrape massive amounts of content to train AI models have become a menace, especially to smaller sites.

    For the first time in the internet’s history, bot activity is currently outstripping human activity online, with AI traffic accounting for over 50%, according to Imperva’s Bad Bot report released last month. Most of that activity is coming from LLMs. But the report also found that malicious bots now make up 37% of all internet traffic. That’s activity that includes everything from persistent scraping to unauthorized login attempts."

    techcrunch.com/2025/08/05/some

    #AI #GenerativeAI #AITraining #Perplexity #Cloudflare #AIAgents #WebCrawling #Chatbots #LLMs

  24. KIMissbrauch

    Cloudflare wirft dem KI-Anbieter ##Perplexity vor, sich mit undeklarierten Crawlern Zugang zu gesperrten Websites zu verschaffen.

    Trotz robots.txt-Verboten und IP-Blockaden soll Perplexity mit wechselnden User-Agents und IPs Inhalte verdeckt auslesen.

    Das wäre eine Verletzung etablierter Webstandards und Missachtung von Website-Präferenzen.

    blog.cloudflare.com/perplexity

    #WebCrawling #BotTraffic #Cloudflare #WebSecurity #PerplexityAI #Chatbots

  25. Ars Technica: Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries. “Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known […]

    https://rbfirehose.com/2025/03/26/ars-technica-open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

  26. Perishable Press: Ultimate Block List to Stop AI Bots. “The focus of this post is aimed at website owners who want to stop AI bots from crawling their web pages, as much as possible. To help people with this, I’ve been collecting data and researching AI bots for many months now, and have put together a ‘Mega Block List’ to help stop AI bots from devouring your content.”

    https://rbfirehose.com/2025/02/12/perishable-press-ultimate-block-list-to-stop-ai-bots/

  27. - 📊 Optional: Markov-generated nonsense content to distort data
    - 💻 Developed by programmer Aaron B. out of frustration with #Webcrawling practices
    - ⚠️ Challenges: Server load, scalability, effectiveness questioned

  28. #Nepenthes: #Tool against #AI webcrawlers 🕷️

    Generates self-referencing links, extends load times. Goal: Trap crawlers in endless loop. Developer warns against casual use. #AI #Webcrawling #DataPrivacy

    🧵 ↓

    heise.de/en/news/Nepenthes-a-t

  29. Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

    https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

  30. "A pseudonymous coder has created and released an open source “tar pit” to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed “offensively” as a honeypot trap to waste AI companies’ resources.

    “It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself,” Aaron B, the creator of Nepenthes, told 404 Media.

    “Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time,” they added. “But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop.”"

    404media.co/developer-creates-

    #AI #GenerativeAI #AITraining #WebCrawling #CyberSecurity

  31. Wow! #TIL about #ArchiveBox, your #selfhosted #alternativeTo @internetarchive!

    Runs on #Python (OS-packaged or #docker‬ed) and saves both single pages or whole website crawls in every format you could wish for:

    ✅ self-contained single-page HTML
    ✅ PDF
    ✅ PNG screenshot
    ✅ plaintext
    ✅ DOM-dump
    ✅ priv./publ. #archive
    ✅ media audio/video included (+yt-dlp)
    #WARC compat.

    🌐 archivebox.io
    📜 github.com/ArchiveBox/ArchiveB
    demo.archivebox.io

    #WebArchiving #WebCrawling #DigitalPreservation

  32. Diese Woche widmen wir uns im #DigitalHistoryOFK gemeinsam mit Annabel Walz (Friedrich-Ebert-Stiftung) dem komplexen Thema der Webarchivierung. Aus gedächtnisinstitutioneller Perspektive wird sie die Eigenschaften von #borndigital & #reborndigital Quellen, aber auch Best Practices für ihre Archivierung diskutieren, die auf #WebCrawling als Praktik & #WARC als Speicherformat setzen.

    🔜 Mi, 29. Nov., 4-6 pm - via Zoom

    ℹ️ Info: dhistory.hypotheses.org/6411

    ___
    #DigitalHistory #WebArchive @histodons

  33. Quando rubano i dati ,anche se non sei d'accordo ...🚨 Perplexity e il furto di dati?

    @eticadigitale

    La nuova ricerca di Robb Knight mette sotto accusa #Perplexity Bot per il mancato rispetto delle indicazioni sui file Robots.txt.
    Cosa significa questo per la #privacy e il diritto d'autore sul web?

    👉 Forbes lo definisce un "cinico furto" e oggi ne parliamo in profondità insieme a Guido Scorza nella nuova puntata di #garantismi

    #webcrawling #AI

    youtu.be/Lkke7g3MQJg

  34. Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

    https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

  35. Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

    https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

  36. Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

    https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/