#web-scraping — Public Fediverse posts on home.social

Civic Data Lab @[email protected] · 2026-07-29 · 11:20 UTC

☕ Wie viel weiß eine Organisation eigentlich über sich selbst?

Beim letzten #CivicDataLab Espresso Talk scrapten ehrenamtliche Data Scientists von Data Science for Social Good (@dssgberlin Berlin) die Webauftritte des AWO Bundesverband e.V. – und fanden deutlich mehr Einrichtungen, als die eigene Datenbank vermuten ließ.

Was hinter der Methode steckt & wie ein Team aus 15 Freiwilligen eine Datenbank mit fast 21.000 Einträgen aufgebaut hat, gibt's im Blog + Talk-Video:

👉 https://civic-data.de/blog/wenn-daten-altern-was-webscraping-fuer-die-awo-leisten-kann/

Code offen auf #GitHub – Nachnutzung ausdrücklich erwünscht.

#DataForGood #Webscraping #NonProfit #Digitalisierung

#digitalisierung #nonprofit #webscraping #dataforgood #github #civicdatalab

Civic Data Lab @[email protected] · 2026-07-29 · 11:20 UTC

☕ Wie viel weiß eine Organisation eigentlich über sich selbst?

Beim letzten #CivicDataLab Espresso Talk scrapten ehrenamtliche Data Scientists von Data Science for Social Good (@dssgberlin Berlin) die Webauftritte des AWO Bundesverband e.V. – und fanden deutlich mehr Einrichtungen, als die eigene Datenbank vermuten ließ.

Was hinter der Methode steckt & wie ein Team aus 15 Freiwilligen eine Datenbank mit fast 21.000 Einträgen aufgebaut hat, gibt's im Blog + Talk-Video:

👉 https://civic-data.de/blog/wenn-daten-altern-was-webscraping-fuer-die-awo-leisten-kann/

Code offen auf #GitHub – Nachnutzung ausdrücklich erwünscht.

#DataForGood #Webscraping #NonProfit #Digitalisierung

#digitalisierung #nonprofit #webscraping #dataforgood #github #civicdatalab

Hackaday [Unofficial] @[email protected] · 2026-07-26 · 05:00 UTC

How Film Industry Data Website The-Numbers.com got Mauled by Bots

https://fed.brid.gy/r/https://hackaday.com/2026/07/25/how-film-industry-data-website-the-numbers-com-got-mauled-by-bots/

#artificialintelligence #news #webcrawler #webscraping

Hackaday [Unofficial] @[email protected] · 2026-07-26 · 05:00 UTC

How Film Industry Data Website The-Numbers.com got Mauled by Bots

https://web.brid.gy/r/https://hackaday.com/2026/07/25/how-film-industry-data-website-the-numbers-com-got-mauled-by-bots/

#artificialintelligence #news #webcrawler #webscraping

ResearchBuzz: Firehose @[email protected] · 2026-07-24 · 13:00 UTC

Stephen Follows: What just happened to TheNumbers.com should worry us all. “Its hand-researched data is the highest quality, tracking box office grosses, budgets, home video and streaming across more than 78,000 films and 236,000 people. It gets north of eight million visitors a year, and is treated as THE definitive authority by journalists, academics, filmmakers, prediction markets, and even […]

https://rbfirehose.com/2026/07/24/stephen-follows-what-just-happened-to-thenumbers-com-should-worry-us-all/

#agenticai #ai #aiagents #aiassisted #contentremoval #crassinsensitivityandassholery

ResearchBuzz: Firehose @[email protected] · 2026-07-24 · 13:00 UTC

Stephen Follows: What just happened to TheNumbers.com should worry us all. “Its hand-researched data is the highest quality, tracking box office grosses, budgets, home video and streaming across more than 78,000 films and 236,000 people. It gets north of eight million visitors a year, and is treated as THE definitive authority by journalists, academics, filmmakers, prediction markets, and even […]

https://rbfirehose.com/2026/07/24/stephen-follows-what-just-happened-to-thenumbers-com-should-worry-us-all/

#agenticai #ai #aiagents #aiassisted #contentremoval #crassinsensitivityandassholery

ResearchBuzz: Firehose @[email protected] · 2026-07-23 · 14:33 UTC

Reuters: News Corp countersues Brave for allegedly ‘scraping’ articles for AI . “News Corp, facing a lawsuit by search engine Brave Software, has filed a countersuit accusing it ‌of “flagrant theft” in distributing and selling versions of articles from the Wall Street Journal and New York Post to AI companies.”

https://rbfirehose.com/2026/07/23/reuters-news-corp-countersues-brave-for-allegedly-scraping-articles-for-ai/

#ai #aitraining #bravebrowser #countersuits #law #lawsuits

ResearchBuzz: Firehose @[email protected] · 2026-07-23 · 14:33 UTC

Reuters: News Corp countersues Brave for allegedly ‘scraping’ articles for AI . “News Corp, facing a lawsuit by search engine Brave Software, has filed a countersuit accusing it ‌of “flagrant theft” in distributing and selling versions of articles from the Wall Street Journal and New York Post to AI companies.”

https://rbfirehose.com/2026/07/23/reuters-news-corp-countersues-brave-for-allegedly-scraping-articles-for-ai/

#ai #aitraining #bravebrowser #countersuits #law #lawsuits

OmbuLabs.ai @[email protected] · 2026-07-22 · 19:59 UTC

Turns out you can't just ask an LLM for CSS selectors and ship them. In our scraping system, first-attempt selectors returned nothing 30 to 40% of the time. The trick that made it work: check JSON-LD first, then run every generated selector through a validation loop against the real DOM before trusting it. https://go.upgradejs.com/qru #LLM #WebScraping #AI

#llm #webscraping #ai

OmbuLabs.ai @[email protected] · 2026-07-22 · 19:59 UTC

Turns out you can't just ask an LLM for CSS selectors and ship them. In our scraping system, first-attempt selectors returned nothing 30 to 40% of the time. The trick that made it work: check JSON-LD first, then run every generated selector through a validation loop against the real DOM before trusting it. https://go.upgradejs.com/qru #LLM #WebScraping #AI

#llm #webscraping #ai

ResearchBuzz: Firehose @[email protected] · 2026-07-22 · 09:22 UTC

MediaPost: Judge Dismisses Google Complaint Against SerpApi Over Scraping. “A federal judge has dismissed Google’s complaint against the Texas-based company SerpApi, which allegedly circumvented attempts to prevent it from scraping search results. The ruling, issued Monday by U.S. District Court Judge Yvonne Gonzalez Rogers, allows Google to amend its complaint and bring it again.”

https://rbfirehose.com/2026/07/22/mediapost-judge-dismisses-google-complaint-against-serpapi-over-scraping/

#google #law #lawsuits #legal #searchengines #serpapi

ResearchBuzz: Firehose @[email protected] · 2026-07-22 · 09:22 UTC

MediaPost: Judge Dismisses Google Complaint Against SerpApi Over Scraping. “A federal judge has dismissed Google’s complaint against the Texas-based company SerpApi, which allegedly circumvented attempts to prevent it from scraping search results. The ruling, issued Monday by U.S. District Court Judge Yvonne Gonzalez Rogers, allows Google to amend its complaint and bring it again.”

https://rbfirehose.com/2026/07/22/mediapost-judge-dismisses-google-complaint-against-serpapi-over-scraping/

#google #law #lawsuits #legal #searchengines #serpapi

Self-Hosted Feed @[email protected] · 2026-07-20 · 19:41 UTC

🕷️ Anakin-Inc/anakin

Converts websites to clean markdown or JSON with fallback scraping handlers and proxy auto-selection

⭐ Stars: 687
📅 Last Update: Jul 20, 2026

https://github.com/Anakin-Inc/anakin

#selfhosted #homelab #selfhost #selfhosting #opensource #webscraping #api

#selfhosted #homelab #selfhost #selfhosting #opensource #webscraping

ResearchBuzz: Firehose @[email protected] · 2026-07-18 · 08:59 UTC

Mixfont: Decoy Font. “Decoy font is a font that prints a decoy for every letter, making it more difficult for AI to read what you type. The font works by using separate spatial frequencies to communicate two different letters in the same space.”

https://rbfirehose.com/2026/07/18/mixfont-decoy-font/

#ai #foilingai #fontography #fonts #webscraping

ResearchBuzz: Firehose @[email protected] · 2026-07-18 · 08:59 UTC

Mixfont: Decoy Font. “Decoy font is a font that prints a decoy for every letter, making it more difficult for AI to read what you type. The font works by using separate spatial frequencies to communicate two different letters in the same space.”

https://rbfirehose.com/2026/07/18/mixfont-decoy-font/

#ai #foilingai #fontography #fonts #webscraping

Hacker News @[email protected] · 2026-07-09 · 15:34 UTC

Launch HN: Context.dev (YC S26) – API to get structured data from any website

https://www.context.dev

Comments: https://news.ycombinator.com/item?id=48847562

#HackerNews #LaunchHN #ContextDev #API #WebScraping #YC #S26

#hackernews #launchhn #contextdev #api #webscraping #yc

Hacker News @[email protected] · 2026-07-09 · 15:34 UTC

Launch HN: Context.dev (YC S26) – API to get structured data from any website

https://www.context.dev

Comments: https://news.ycombinator.com/item?id=48847562

#HackerNews #LaunchHN #ContextDev #API #WebScraping #YC #S26

#hackernews #launchhn #contextdev #api #webscraping #yc

ResearchBuzz: Firehose @[email protected] · 2026-07-06 · 14:33 UTC

Engadget: Cloudflare will filter out web crawlers that serve AI companies . “Cloudflare has announced plans to automatically block mixed-use web crawlers that index websites for search engines and act as AI agents and trainers at the same time. The company previously offered its customers the optional ability to prevent crawlers from scraping their sites for AI chatbots, but now Cloudflare’s […]

https://rbfirehose.com/2026/07/06/engadget-cloudflare-will-filter-out-web-crawlers-that-serve-ai-companies/

#agenticai #ai #aiagents #aitraining #aiassisted #cloudflare

ResearchBuzz: Firehose @[email protected] · 2026-07-06 · 14:33 UTC

Engadget: Cloudflare will filter out web crawlers that serve AI companies . “Cloudflare has announced plans to automatically block mixed-use web crawlers that index websites for search engines and act as AI agents and trainers at the same time. The company previously offered its customers the optional ability to prevent crawlers from scraping their sites for AI chatbots, but now Cloudflare’s […]

https://rbfirehose.com/2026/07/06/engadget-cloudflare-will-filter-out-web-crawlers-that-serve-ai-companies/

#agenticai #ai #aiagents #aitraining #aiassisted #cloudflare

Arint - SEO+KI @[email protected] · 2026-07-03 · 10:06 UTC

RT @NousResearch: Der Hermes-Agent liest die Webinhalte nun bis zu 60-mal schneller und zu 49-mal niedrigeren Kosten. Scraping-Backends übergeben saubere Inhalte direkt an den Agenten, ohne redundante Verarbeitungsschritte; große Seiten werden lokal gespeichert und bei Bedarf seitenweise abgerufen, sodass Sie die gleiche Qualität zu einem Bruchteil der Zeit und Kosten erhalten. Video

mehr auf Arint.info

#Effizienz #HermesAgent #Kostensenkung #Video #WebScraping #arint_info

https://x.com/NousResearch/status/2071974594961977727#m

#effizienz #hermesagent #kostensenkung #video #webscraping #arint_info

HackerNoon @[email protected] · 2026-07-03 · 04:51 UTC

I tested every way to scrape Amazon in 2026 — plain requests, Selenium, Playwright, free proxies, paid proxies. https://hackernoon.com/i-tried-every-way-to-scrape-amazon-in-2026-here-is-what-actually-works #webscraping

#webscraping

HackerNoon @[email protected] · 2026-07-03 · 04:51 UTC

I tested every way to scrape Amazon in 2026 — plain requests, Selenium, Playwright, free proxies, paid proxies. https://hackernoon.com/i-tried-every-way-to-scrape-amazon-in-2026-here-is-what-actually-works #webscraping

#webscraping

Arint - SEO+KI @[email protected] · 2026-07-01 · 10:01 UTC

RT @NousResearch: Der Hermes-Agent liest das Web nun bis zu 60-mal schneller und 49-mal günstiger. Scraping-Backends übergeben saubere Inhalte direkt an den Agenten ohne redundante Verarbeitungsschritte; große Seiten werden lokal gespeichert und bei Bedarf aufgerufen, sodass Sie die gleiche Qualität zu einem Bruchteil der Zeit und Kosten erhalten. Video

mehr auf Arint.info

#HermesAgent #Kosteneffizienz #Performance #Technologie #WebScraping #arint_info

https://x.com/NousResearch/status/2071974594961977727#m

#hermesagent #kosteneffizienz #performance #technologie #webscraping #arint_info

Zyte @[email protected] · 2026-06-27 · 06:24 UTC

Why configure your AI Model Harness? #webscraping #podcast #ai #programming https://www.youtube.com/watch?v=xdVcK-XfxLo?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

#webscraping #podcast #ai #programming #webdata #data

Zyte @[email protected] · 2026-06-27 · 06:24 UTC

Why configure your AI Model Harness? #webscraping #podcast #ai #programming https://www.youtube.com/watch?v=xdVcK-XfxLo?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

#webscraping #podcast #ai #programming #webdata #data

Zyte @[email protected] · 2026-06-27 · 06:22 UTC

Keep your context window clean using these #webscraping #ai #podcast https://www.youtube.com/watch?v=wNbLzW1huZM?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

#webscraping #ai #podcast #webdata #data #web

Zyte @[email protected] · 2026-06-27 · 06:22 UTC

Keep your context window clean using these #webscraping #ai #podcast https://www.youtube.com/watch?v=wNbLzW1huZM?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

#webscraping #ai #podcast #webdata #data #web

ResearchBuzz: Firehose @[email protected] · 2026-06-25 · 15:25 UTC

New Jersey Globe: Nearly 400 local newspapers sue OpenAI, Microsoft over alleged copyright theft. “The massive coalition of local newspaper publishers filed a federal lawsuit today against OpenAI and Microsoft, alleging the technology companies systematically copied copyrighted reporting from nearly 400 local newspapers to train and develop commercial artificial intelligence products, including […]

https://rbfirehose.com/2026/06/25/new-jersey-globe-nearly-400-local-newspapers-sue-openai-microsoft-over-alleged-copyright-theft/

#ai #aitraining #aiassisted #copyright #intellectualproperty #journalism

ResearchBuzz: Firehose @[email protected] · 2026-06-25 · 15:25 UTC

New Jersey Globe: Nearly 400 local newspapers sue OpenAI, Microsoft over alleged copyright theft. “The massive coalition of local newspaper publishers filed a federal lawsuit today against OpenAI and Microsoft, alleging the technology companies systematically copied copyrighted reporting from nearly 400 local newspapers to train and develop commercial artificial intelligence products, including […]

https://rbfirehose.com/2026/06/25/new-jersey-globe-nearly-400-local-newspapers-sue-openai-microsoft-over-alleged-copyright-theft/

#ai #aitraining #aiassisted #copyright #intellectualproperty #journalism

ResearchBuzz: Firehose @[email protected] · 2026-06-11 · 15:44 UTC

Search Engine Journal: US Publishers Demand Common Crawl Stop Scraping Their Content. “Digital Content Next, a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation. The letter demands Common Crawl stop collecting publisher content and remove material already in its datasets.”

https://rbfirehose.com/2026/06/11/search-engine-journal-us-publishers-demand-common-crawl-stop-scraping-their-content/

#ceaseanddesist #commoncrawl #digitalcontentnext #journalism #media #searchengines

ResearchBuzz: Firehose @[email protected] · 2026-06-11 · 15:44 UTC

Search Engine Journal: US Publishers Demand Common Crawl Stop Scraping Their Content. “Digital Content Next, a trade body representing US digital publishers, has sent a cease and desist letter to the Common Crawl Foundation. The letter demands Common Crawl stop collecting publisher content and remove material already in its datasets.”

https://rbfirehose.com/2026/06/11/search-engine-journal-us-publishers-demand-common-crawl-stop-scraping-their-content/

#ceaseanddesist #commoncrawl #digitalcontentnext #journalism #media #searchengines

HackerNoon @[email protected] · 2026-06-10 · 01:39 UTC

Web scraping gets blocked by weak headers, broken sessions, poor IP reputation, fast requests, and careless proxy rotation. https://hackernoon.com/why-scrapers-fail-headers-sessions-ip-reputation-and-request-patterns #webscraping

#webscraping

HackerNoon @[email protected] · 2026-06-10 · 01:39 UTC

Web scraping gets blocked by weak headers, broken sessions, poor IP reputation, fast requests, and careless proxy rotation. https://hackernoon.com/why-scrapers-fail-headers-sessions-ip-reputation-and-request-patterns #webscraping

#webscraping

Zyte @[email protected] · 2026-06-09 · 17:25 UTC

No-one likes an out-of-touch AI assistant. Fortunately, rapid refreshing can keep AI models aware of the very latest public information. https://www.zyte.com/blog/enhancing-ai-model-performance-with-fresh-web-data?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web

N-gated Hacker News @[email protected] · 2026-06-08 · 13:38 UTC

🎉 Behold, the future of browser automation, where writing code is for peasants! 🤖 Intuned claims to solve all your web scraping woes with magic #AI that doesn't need you, unless it breaks. But don't worry, you have a whole smorgasbord of buzzwords like "Playwright" and "RPAAI" to keep you entertained while you wait. 🚀✨
https://intunedhq.com #browserautomation #webscraping #Playwright #RPAAI #HackerNews #ngated

#hackernews #ngated #ai #browserautomation #webscraping #playwright

N-gated Hacker News @[email protected] · 2026-06-08 · 13:38 UTC

🎉 Behold, the future of browser automation, where writing code is for peasants! 🤖 Intuned claims to solve all your web scraping woes with magic #AI that doesn't need you, unless it breaks. But don't worry, you have a whole smorgasbord of buzzwords like "Playwright" and "RPAAI" to keep you entertained while you wait. 🚀✨
https://intunedhq.com #browserautomation #webscraping #Playwright #RPAAI #HackerNews #ngated

#ai #browserautomation #webscraping #playwright #rpaai #hackernews

Filippos Dimitrios KTISTAKIS @[email protected] · 2026-06-05 · 07:30 UTC

Released hydrascrape — a free, self-healing scraper for sites behind Imperva/Distil bot walls.

patched-Chrome (patchright) clears the JS challenge; a fleet of Tor exits beats per-IP limits and auto-rotates the bad ones. Resumable, with a live dashboard. Proven on a ~58k-page catalog from one laptop, $0.

Full "what I tried and why" writeup included. MIT.

🔗 github.com/philipposk/hydrascrape
#OpenSource #WebScraping #Tor #Python #SelfHosting

#opensource #webscraping #tor #python #selfhosting

Zyte @[email protected] · 2026-06-04 · 12:29 UTC

When you can scrape the web by API, a world of possibility opens up. Yes, you can extract live web data using iOS Shortcuts. https://www.zyte.com/blog/web-scraping-on-an-iphone?utm_campaign=blog-posts&utm_activity=ORS&utm_medium=social&utm_source=mastodon

#webscraping #webdata #data #web