#webcrawling — Public Fediverse posts on home.social

PPC Land @[email protected] · 2026-04-27 · 21:18 UTC

FYI: OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining: New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%. https://ppc.land/openai-tripled-its-web-crawl-after-gpt-5-but-chatgpt-users-may-be-declining/ #OpenAI #GPT5 #ChatGPT #AItrends #webcrawling

#openai #gpt5 #chatgpt #aitrends #webcrawling

PPC Land @[email protected] · 2026-04-27 · 21:18 UTC

FYI: OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining: New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%. https://ppc.land/openai-tripled-its-web-crawl-after-gpt-5-but-chatgpt-users-may-be-declining/ #OpenAI #GPT5 #ChatGPT #AItrends #webcrawling

#webcrawling #aitrends #chatgpt #gpt5 #openai

PPC Land @[email protected] · 2026-04-27 · 21:18 UTC

FYI: OpenAI tripled its web crawl after GPT-5 - but ChatGPT users may be declining: New log file analysis of 7 billion OpenAI bot events reveals a 3.5x surge in OAI-SearchBot activity after GPT-5, while ChatGPT user-driven events dropped 28%. https://ppc.land/openai-tripled-its-web-crawl-after-gpt-5-but-chatgpt-users-may-be-declining/ #OpenAI #GPT5 #ChatGPT #AItrends #webcrawling

#openai #gpt5 #chatgpt #aitrends #webcrawling

PPC Land @[email protected] · 2026-03-26 · 15:03 UTC

FYI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #UserAgent #WebCrawling #ProjectMariner

#googleagent #aibrowsing #useragent #webcrawling #projectmariner

PPC Land @[email protected] · 2026-03-26 · 15:03 UTC

FYI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #UserAgent #WebCrawling #ProjectMariner

#projectmariner #webcrawling #useragent #aibrowsing #googleagent

PPC Land @[email protected] · 2026-03-26 · 15:03 UTC

FYI: Google-Agent joins the crawler list as AI browsing gets an official identity: Google on March 20 added Google-Agent to its user-triggered fetchers list, formalizing a new user agent for AI systems like Project Mariner that navigate the web on behalf of users. https://ppc.land/google-agent-joins-the-crawler-list-as-ai-browsing-gets-an-official-identity/ #GoogleAgent #AIBrowsing #UserAgent #WebCrawling #ProjectMariner

#googleagent #aibrowsing #useragent #webcrawling #projectmariner

PPC Land @[email protected] · 2026-03-15 · 21:36 UTC

FYI: Googlebot is not a program - Google engineers finally explain what it really is: Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints. https://ppc.land/googlebot-is-not-a-program-google-engineers-finally-explain-what-it-really-is/ #Googlebot #SEO #WebCrawling #DigitalMarketing #SaaS

#googlebot #seo #webcrawling #digitalmarketing #saas

PPC Land @[email protected] · 2026-03-12 · 21:34 UTC

Googlebot is not a program - Google engineers finally explain what it really is: Google engineers reveal Googlebot is a misnomer for a central SaaS crawling platform serving dozens of products, with a 15 MB default file size limit and geo-crawling constraints. https://ppc.land/googlebot-is-not-a-program-google-engineers-finally-explain-what-it-really-is/ #Googlebot #SEO #WebCrawling #SaaS #DigitalMarketing

#googlebot #seo #webcrawling #saas #digitalmarketing

PPC Land @[email protected] · 2026-03-11 · 06:26 UTC

FYI: Google's secret crawl logic, finally explained in one page: Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing. https://ppc.land/googles-secret-crawl-logic-finally-explained-in-one-page/ #Google #SEO #WebCrawling #Googlebot #DigitalMarketing

#google #seo #webcrawling #googlebot #digitalmarketing

PPC Land @[email protected] · 2026-03-09 · 06:25 UTC

ICYMI: Google's secret crawl logic, finally explained in one page: Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing. https://ppc.land/googles-secret-crawl-logic-finally-explained-in-one-page/ #Google #SEO #WebCrawling #Googlebot #DigitalMarketing

#google #seo #webcrawling #googlebot #digitalmarketing

PPC Land @[email protected] · 2026-03-08 · 06:24 UTC

Google's secret crawl logic, finally explained in one page: Google published a new web crawling overview on March 3, 2026, detailing how Googlebot discovers, renders, and manages site access across 30+ years of web indexing. https://ppc.land/googles-secret-crawl-logic-finally-explained-in-one-page/ #Google #SEO #WebCrawling #Googlebot #DigitalMarketing

#google #seo #webcrawling #googlebot #digitalmarketing

AI Daily Post @[email protected] · 2026-02-26 · 16:43 UTC

Smart TVs are now running Bright SDK to silently crawl the web for AI training, using residential proxies to bypass Google policy. The move sparks a compliance backlash and raises privacy concerns for home devices. How will regulators respond, and what does this mean for open‑source AI? Dive into the details. #SmartTV #BrightSDK #WebCrawling #DeviceCompliance

🔗 https://aidailypost.com/news/smart-tvs-using-bright-sdk-crawl-web-ai-amid-compliance-backlash

#smarttv #brightsdk #webcrawling #devicecompliance

AI Daily Post @[email protected] · 2026-02-26 · 16:43 UTC

Smart TVs are now running Bright SDK to silently crawl the web for AI training, using residential proxies to bypass Google policy. The move sparks a compliance backlash and raises privacy concerns for home devices. How will regulators respond, and what does this mean for open‑source AI? Dive into the details. #SmartTV #BrightSDK #WebCrawling #DeviceCompliance

🔗 https://aidailypost.com/news/smart-tvs-using-bright-sdk-crawl-web-ai-amid-compliance-backlash

#devicecompliance #webcrawling #brightsdk #smarttv

AI Daily Post @[email protected] · 2026-02-26 · 16:43 UTC

Smart TVs are now running Bright SDK to silently crawl the web for AI training, using residential proxies to bypass Google policy. The move sparks a compliance backlash and raises privacy concerns for home devices. How will regulators respond, and what does this mean for open‑source AI? Dive into the details. #SmartTV #BrightSDK #WebCrawling #DeviceCompliance

🔗 https://aidailypost.com/news/smart-tvs-using-bright-sdk-crawl-web-ai-amid-compliance-backlash

#smarttv #brightsdk #webcrawling #devicecompliance

N-gated Hacker News @[email protected] · 2026-01-20 · 18:25 UTC

Ah, #wxpath, because using #XPath was just too easy before 🙄. Now with extra layers of #complexity, just in case you weren't already confused enough by web crawling! 🕸️😵‍💫
https://github.com/rodricios/wxpath #webcrawling #technews #developerhumor #HackerNews #ngated

#wxpath #xpath #complexity #webcrawling #technews #developerhumor

Hacker News @[email protected] · 2026-01-20 · 18:25 UTC

wxpath – Declarative web crawling in XPath

https://github.com/rodricios/wxpath

#HackerNews #wxpath #webcrawling #XPath #technology #open-source #GitHub

#hackernews #wxpath #webcrawling #xpath #technology #open

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#google #reddit #serpapi #anticircumvention #circumvention #copyright

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#google #reddit #serpapi #anticircumvention #circumvention #copyright

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#webcrawling #robotstxt #openweb #licensing #dmca1201 #copyright

Techdirt [Unofficial] @[email protected] · 2025-12-24 · 19:05 UTC

Google Built Its Empire Scraping The Web. Now It’s Suing To Stop Others From Scraping Google

https://fed.brid.gy/r/https://www.techdirt.com/2025/12/24/google-built-its-empire-scraping-the-web-now-its-suing-to-stop-others-from-scraping-google/

#google #reddit #serpapi #anticircumvention #circumvention #copyright

Marcus Schuler @[email protected] · 2025-12-15 · 22:00 UTC

Cloudflare's 2025 data reveals Google's structural advantage in AI training: Googlebot crawled 11.6% of web pages vs OpenAI's 3.6%. Publishers face an impossible choice - they can't block Google's AI crawling without losing search visibility entirely, since the same bot handles both functions. #AI #WebCrawling #DigitalRights

https://www.implicator.ai/googles-quiet-conquest-what-cloudflares-data-actually-reveals-about-ais-power-grab/

#ai #webcrawling #digitalrights

ResearchBuzz: Firehose @[email protected] · 2025-12-01 · 14:19 UTC

Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

#bots #openai #webcrawlers #webcrawling #webscraping

Jörn Franke @[email protected] · 2025-11-01 · 16:38 UTC

Released scrapy-contrib-bigexporter 1.0.0 (https://codeberg.org/ZuInnoTe/scrapy-contrib-bigexporters) - additional export formats for the webscraping framework Scrapy.

Migrated parquet export from fastparquet to pyarrow as fastparquet is deprecated (https://docs.dask.org/en/stable/changelog.html#fastparquet-engine-deprecated)

Migrated orc export from pyorc to pyarrow to reduce the number of dependencies

#scrapy #crawling #python #parquet #orc #pyarrow #webcrawling #scraping

#scrapy #crawling #python #parquet #orc #pyarrow

Mind Lude @[email protected] · 2025-09-11 · 22:50 UTC

Remember when 'robots.txt' was supposed to solve all our crawling problems? Online media brands are trying a new protocol to deter 'unwanted' AI crawlers. Because clearly, we need more digital fences. What's your bet on how long it takes for a savvy AI to find a workaround?

#AI #TechNews #WebCrawling #DigitalRights #Privacy

#ai #technews #webcrawling #digitalrights #privacy

ResearchBuzz: Firehose @[email protected] · 2025-08-31 · 17:47 UTC

Search Engine Land: Google fixes reduced crawling issue impacting some websites. “Google has confirmed it fixed an issue with its crawlers impacting ‘some sites.’ The issue was ‘reduced / fluctuating crawling’ from Google’s end with Googlebot. It is now resolved and Google said the crawling should pick back up in the near future.”

https://rbfirehose.com/2025/08/31/search-engine-land-google-fixes-reduced-crawling-issue-impacting-some-websites/

#google #searchengines #seo #serp #webcrawling #websearch

Hacker News @[email protected] · 2025-08-18 · 17:34 UTC

Robots.txt Is a Suicide Note

https://wiki.archiveteam.org/index.php/Robots.txt

#HackerNews #RobotsTxt #SuicideNote #WebCrawling #InternetArchive #TechEthics

#hackernews #robotstxt #suicidenote #webcrawling #internetarchive #techethics

Miguel Afonso Caetano @[email protected] · 2025-08-06 · 21:39 UTC

"Perplexity’s accusations aren’t exactly fair, either. One argument that Prince and Cloudflare used for calling out Perplexity’s methods was that OpenAI doesn’t behave in the same way.

“OpenAI is an example of a leading AI company that follows these best practices,” Cloudflare wrote. “They respect robots.txt and do not try to evade either a robots.txt directive or a network level block. And ChatGPT Agent is signing http requests using the newly proposed open standard Web Bot Auth.”

Web Bot Auth is a Cloudflare-supported standard being developed by the Internet Engineering Task Force that hopes to create a cryptographic method for identifying AI agent web requests.

The debate comes as bot activity reshapes the internet. As TechCrunch has previously reported, bots seeking to scrape massive amounts of content to train AI models have become a menace, especially to smaller sites.

For the first time in the internet’s history, bot activity is currently outstripping human activity online, with AI traffic accounting for over 50%, according to Imperva’s Bad Bot report released last month. Most of that activity is coming from LLMs. But the report also found that malicious bots now make up 37% of all internet traffic. That’s activity that includes everything from persistent scraping to unauthorized login attempts."

https://techcrunch.com/2025/08/05/some-people-are-defending-perplexity-after-cloudflare-named-and-shamed-it/

#AI #GenerativeAI #AITraining #Perplexity #Cloudflare #AIAgents #WebCrawling #Chatbots #LLMs

#ai #generativeai #aitraining #perplexity #cloudflare #aiagents

Winbuzzer @[email protected] · 2025-08-06 · 13:56 UTC

Perplexity Fires Back at Cloudflare, Denying ‘Stealth Crawler’ Accusations

#AI #Cloudflare #Perplexity #WebCrawling #AIethics #DataScraping #SearchEngines #Web #AISearch

https://winbuzzer.com/2025/08/06/perplexity-fires-back-at-cloudflare-denying-stealth-crawler-accusations-xcxwbn

#ai #cloudflare #perplexity #webcrawling #aiethics #datascraping

Tino Eberl @[email protected] · 2025-08-06 · 08:10 UTC

KIMissbrauch

Cloudflare wirft dem KI-Anbieter ##Perplexity vor, sich mit undeklarierten Crawlern Zugang zu gesperrten Websites zu verschaffen.

Trotz robots.txt-Verboten und IP-Blockaden soll Perplexity mit wechselnden User-Agents und IPs Inhalte verdeckt auslesen.

Das wäre eine Verletzung etablierter Webstandards und Missachtung von Website-Präferenzen.

https://blog.cloudflare.com/perplexity-is-using-stealth-undeclared-crawlers-to-evade-website-no-crawl-directives/

#WebCrawling #BotTraffic #Cloudflare #WebSecurity #PerplexityAI #Chatbots

#perplexity #webcrawling #bottraffic #cloudflare #websecurity #perplexityai

Hacker News @[email protected] · 2025-03-27 · 13:24 UTC

Crawl Order and Disorder

https://www.marginalia.nu/log/a_117_crawl_order/

#HackerNews #Crawl #Order #Disorder #WebCrawling #TechInsights #Marginalia

#hackernews #crawl #order #disorder #webcrawling #techinsights

ResearchBuzz: Firehose @[email protected] · 2025-03-26 · 09:51 UTC

Ars Technica: Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries. “Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known […]

https://rbfirehose.com/2025/03/26/ars-technica-open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

#ai #aitraining #aiassisted #software #softwaredevelopers #trainingai

ResearchBuzz: Firehose @[email protected] · 2025-02-12 · 16:08 UTC

Perishable Press: Ultimate Block List to Stop AI Bots. “The focus of this post is aimed at website owners who want to stop AI bots from crawling their web pages, as much as possible. To help people with this, I’ve been collecting data and researching AI bots for many months now, and have put together a ‘Mega Block List’ to help stop AI bots from devouring your content.”

https://rbfirehose.com/2025/02/12/perishable-press-ultimate-block-list-to-stop-ai-bots/

#ai #aitraining #botblocking #bots #trainingai #webcrawlers

michabbb @[email protected] · 2025-01-25 · 11:21 UTC

- 📊 Optional: Markov-generated nonsense content to distort data
- 💻 Developed by programmer Aaron B. out of frustration with #Webcrawling practices
- ⚠️ Challenges: Server load, scalability, effectiveness questioned

#webcrawling

michabbb @[email protected] · 2025-01-25 · 11:21 UTC

#Nepenthes: #Tool against #AI webcrawlers 🕷️

Generates self-referencing links, extends load times. Goal: Trap crawlers in endless loop. Developer warns against casual use. #AI #Webcrawling #DataPrivacy

🧵 ↓

https://www.heise.de/en/news/Nepenthes-a-tarpit-for-AI-web-crawlers-10256257.html

#nepenthes #tool #ai #webcrawling #dataprivacy

ResearchBuzz: Firehose @[email protected] · 2025-01-24 · 14:28 UTC

Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

#ai #aitraining #foilingai #honeytraps #trainingai #webcrawlers

ResearchBuzz: Firehose @[email protected] · 2025-01-24 · 14:28 UTC

Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

#ai #aitraining #foilingai #honeytraps #trainingai #webcrawlers

ResearchBuzz: Firehose @[email protected] · 2025-01-24 · 14:28 UTC

Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

#webcrawling #webcrawlers #trainingai #honeytraps #foilingai #aitraining

ResearchBuzz: Firehose @[email protected] · 2025-01-24 · 14:28 UTC

Hackaday: Trap Naughty Web Crawlers In Digestive Juices With Nepenthes. “More commonly known as ‘pitcher plants’, nepenthes is a genus of carnivorous plants that use a fluid-filled cup to trap insects and small critters unfortunate enough to slip & slide down into it. In the case of this Lua-based project the idea is roughly the same. Configured as a trap behind a web server (e.g. […]

https://rbfirehose.com/2025/01/24/hackaday-trap-naughty-web-crawlers-in-digestive-juices-with-nepenthes/

#ai #aitraining #foilingai #honeytraps #trainingai #webcrawlers

Miguel Afonso Caetano @[email protected] · 2025-01-24 · 13:43 UTC

"A pseudonymous coder has created and released an open source “tar pit” to indefinitely trap AI training web crawlers in an infinitely, randomly-generating series of pages to waste their time and computing power. The program, called Nepenthes after the genus of carnivorous pitcher plants which trap and consume their prey, can be deployed by webpage owners to protect their own content from being scraped or can be deployed “offensively” as a honeypot trap to waste AI companies’ resources.

“It's less like flypaper and more an infinite maze holding a minotaur, except the crawler is the minotaur that cannot get out. The typical web crawler doesn't appear to have a lot of logic. It downloads a URL, and if it sees links to other URLs, it downloads those too. Nepenthes generates random links that always point back to itself - the crawler downloads those new links. Nepenthes happily just returns more and more lists of links pointing back to itself,” Aaron B, the creator of Nepenthes, told 404 Media.

“Of course, these crawlers are massively scaled, and are downloading links from large swathes of the internet at any given time,” they added. “But they are still consuming resources, spinning around doing nothing helpful, unless they find a way to detect that they are stuck in this loop.”"

https://www.404media.co/developer-creates-infinite-maze-to-trap-ai-crawlers-in/

#AI #GenerativeAI #AITraining #WebCrawling #CyberSecurity

#ai #generativeai #aitraining #webcrawling #cybersecurity

The Fulcrum⚒️ @[email protected] · 2025-01-23 · 16:40 UTC

Developer Creates Infinite Maze That Traps AI Training Bots. #AI #WebCrawling
https://www.404media.co/email/7a39d947-4a4a-42bc-bbcf-3379f112c999/

#ai #webcrawling

Max Resing @[email protected] · 2025-01-11 · 12:52 UTC

It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?

I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?

I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?

We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?

To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.

Some concerning #news articles on the topic:

#webcrawling #crawler #web #opensource

#crawling #web #generativeai #opendata #cloud #operator

The Privacy Post @[email protected] · 2024-06-25 · 21:36 UTC

Quando rubano i dati ,anche se non sei d'accordo ...🚨 Perplexity e il furto di dati?

@eticadigitale

La nuova ricerca di Robb Knight mette sotto accusa #Perplexity Bot per il mancato rispetto delle indicazioni sui file Robots.txt.
Cosa significa questo per la #privacy e il diritto d'autore sul web?

👉 Forbes lo definisce un "cinico furto" e oggi ne parliamo in profondità insieme a Guido Scorza nella nuova puntata di #garantismi

#webcrawling #AI

youtu.be/Lkke7g3MQJg

#privacy #ai #garantismi #perplexity #webcrawling

me·ta·phil, der @[email protected] · 2024-01-08 · 19:41 UTC

Wow! #TIL about #ArchiveBox, your #selfhosted #alternativeTo @internetarchive!

Runs on #Python (OS-packaged or #docker‬ed) and saves both single pages or whole website crawls in every format you could wish for:

✅ self-contained single-page HTML
✅ PDF
✅ PNG screenshot
✅ plaintext
✅ DOM-dump
✅ priv./publ. #archive
✅ media audio/video included (+yt-dlp)
✅ #WARC compat.

🌐 https://archivebox.io
📜 https://github.com/ArchiveBox/ArchiveBox
▶ https://demo.archivebox.io

#WebArchiving #WebCrawling #DigitalPreservation

#til #archivebox #selfhosted #alternativeto #python #docker

me·ta·phil, der @[email protected] · 2024-01-08 · 19:41 UTC

Wow! #TIL about #ArchiveBox, your #selfhosted #alternativeTo @internetarchive!

Runs on #Python (OS-packaged or #docker‬ed) and saves both single pages or whole website crawls in every format you could wish for:

✅ self-contained single-page HTML
✅ PDF
✅ PNG screenshot
✅ plaintext
✅ DOM-dump
✅ priv./publ. #archive
✅ media audio/video included (+yt-dlp)
✅ #WARC compat.

🌐 https://archivebox.io
📜 https://github.com/ArchiveBox/ArchiveBox
▶ https://demo.archivebox.io

#WebArchiving #WebCrawling #DigitalPreservation

#til #archivebox #selfhosted #alternativeto #python #docker

me·ta·phil, der @[email protected] · 2024-01-08 · 19:41 UTC

Wow! #TIL about #ArchiveBox, your #selfhosted #alternativeTo @internetarchive!

Runs on #Python (OS-packaged or #docker‬ed) and saves both single pages or whole website crawls in every format you could wish for:

✅ self-contained single-page HTML
✅ PDF
✅ PNG screenshot
✅ plaintext
✅ DOM-dump
✅ priv./publ. #archive
✅ media audio/video included (+yt-dlp)
✅ #WARC compat.

🌐 https://archivebox.io
📜 https://github.com/ArchiveBox/ArchiveBox
▶ https://demo.archivebox.io

#WebArchiving #WebCrawling #DigitalPreservation

#til #archivebox #selfhosted #alternativeto #python #docker

me·ta·phil, der @[email protected] · 2024-01-08 · 19:41 UTC

Wow! #TIL about #ArchiveBox, your #selfhosted #alternativeTo @internetarchive!

Runs on #Python (OS-packaged or #docker‬ed) and saves both single pages or whole website crawls in every format you could wish for:

✅ self-contained single-page HTML
✅ PDF
✅ PNG screenshot
✅ plaintext
✅ DOM-dump
✅ priv./publ. #archive
✅ media audio/video included (+yt-dlp)
✅ #WARC compat.

🌐 https://archivebox.io
📜 https://github.com/ArchiveBox/ArchiveBox
▶ https://demo.archivebox.io

#WebArchiving #WebCrawling #DigitalPreservation

#digitalpreservation #webcrawling #webarchiving #warc #archive #docker

me·ta·phil, der @[email protected] · 2024-01-08 · 19:41 UTC

Wow! #TIL about #ArchiveBox, your #selfhosted #alternativeTo @internetarchive!

Runs on #Python (OS-packaged or #docker‬ed) and saves both single pages or whole website crawls in every format you could wish for:

✅ self-contained single-page HTML
✅ PDF
✅ PNG screenshot
✅ plaintext
✅ DOM-dump
✅ priv./publ. #archive
✅ media audio/video included (+yt-dlp)
✅ #WARC compat.

🌐 https://archivebox.io
📜 https://github.com/ArchiveBox/ArchiveBox
▶ https://demo.archivebox.io

#WebArchiving #WebCrawling #DigitalPreservation

#til #archivebox #selfhosted #alternativeto #python #docker

Digital History Berlin @[email protected] · 2023-11-28 · 08:40 UTC

Diese Woche widmen wir uns im #DigitalHistoryOFK gemeinsam mit Annabel Walz (Friedrich-Ebert-Stiftung) dem komplexen Thema der Webarchivierung. Aus gedächtnisinstitutioneller Perspektive wird sie die Eigenschaften von #borndigital & #reborndigital Quellen, aber auch Best Practices für ihre Archivierung diskutieren, die auf #WebCrawling als Praktik & #WARC als Speicherformat setzen.

🔜 Mi, 29. Nov., 4-6 pm - via Zoom

ℹ️ Info: https://dhistory.hypotheses.org/6411

___
#DigitalHistory #WebArchive @histodons

#digitalhistoryofk #borndigital #reborndigital #webcrawling #warc #digitalhistory

Digital History Berlin @[email protected] · 2023-11-28 · 08:40 UTC

Diese Woche widmen wir uns im #DigitalHistoryOFK gemeinsam mit Annabel Walz (Friedrich-Ebert-Stiftung) dem komplexen Thema der Webarchivierung. Aus gedächtnisinstitutioneller Perspektive wird sie die Eigenschaften von #borndigital & #reborndigital Quellen, aber auch Best Practices für ihre Archivierung diskutieren, die auf #WebCrawling als Praktik & #WARC als Speicherformat setzen.

🔜 Mi, 29. Nov., 4-6 pm - via Zoom

ℹ️ Info: https://dhistory.hypotheses.org/6411

___
#DigitalHistory #WebArchive @histodons

#digitalhistoryofk #borndigital #reborndigital #webcrawling #warc #digitalhistory

Digital History Berlin @[email protected] · 2023-11-28 · 08:40 UTC

Diese Woche widmen wir uns im #DigitalHistoryOFK gemeinsam mit Annabel Walz (Friedrich-Ebert-Stiftung) dem komplexen Thema der Webarchivierung. Aus gedächtnisinstitutioneller Perspektive wird sie die Eigenschaften von #borndigital & #reborndigital Quellen, aber auch Best Practices für ihre Archivierung diskutieren, die auf #WebCrawling als Praktik & #WARC als Speicherformat setzen.

🔜 Mi, 29. Nov., 4-6 pm - via Zoom

ℹ️ Info: https://dhistory.hypotheses.org/6411

___
#DigitalHistory #WebArchive @histodons

#digitalhistoryofk #borndigital #reborndigital #webcrawling #warc #digitalhistory