home.social

#web-scraping — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #web-scraping, aggregated by home.social.

fetched live
  1. 🎉 Look, another web scraper! 🎉 Because we *definitely* needed one more tool to fetch JSON from #Wikipedia faster than a cheetah on Red Bull. 🐆💨 No doubt, this will revolutionize the already groundbreaking field of #scraping celebrity birthdates. 🙄✨
    scrapewithruno.com/ #webscraping #JSONtools #celebritybirthdates #technology #HackerNews #ngated

  2. Some scraping APIs fail on 62% of requests. We benchmarked 12 tools against protected sites so you don't waste budget! hackernoon.com/web-scraping-ap #webscraping

  3. Fedi, I need your input! Since the web is dead and googling "which user agent to use for responsible web scraping" mostly returns AI-generated garbage promoting how to spoof the user agent for non-responsible web scraping: What are your best practices? Any guide you would recommend? #FediHelp #WebScraping #DigitalHumanities

  4. Heurísticas para Web Scraping co…

    Los Modelos de Lenguaje de Gran Escala (LLMs) son sistemas de inteligencia artificial diseñados para entender y generar texto. Su aplicación en web scraping radica en la capacidad de analizar y extraer datos de páginas web complejas.

    norvik.tech/news/analisis-llms

    #Technology #WebScraping #Llms #Heuristicas #DesarrolloWeb #NorvikTech #DesarrolloSoftware #TechInnovation

  5. 37% → 78%. Doubled my web scraper's success rate by swapping requests for curl_cffi to mimic Chrome's TLS handshake.

    Bonus: deleting 22 lines of "defensive" header overrides added another 2pp. They were undermining the impersonation.

    Modern WAFs fingerprint TLS ClientHello and HTTP/2 SETTINGS, not User-Agents.

    mikenoe.com/posts/tls-fingerpr

    #python #webscraping #ai

  6. ICYMI: News publishers target Common Crawl, the AI training data backdoor: News/Media Alliance sent a formal letter to Common Crawl demanding it stop unauthorized scraping and block AI companies from using news content for training. ppc.land/news-publishers-targe #AI #NewsMedia #CommonCrawl #DataPrivacy #WebScraping

  7. The Register: Stale gov.uk pages are feeding AI overviews old data and Brits are believing it. “AI overviews from the likes of Google are serving up false summaries of UK government information by drawing on stale GOV.UK pages, according to content designers at the Department for Business and Trade (DBT). The problem, senior content designer Giorgio Di Tunno and content operations lead Neil […]

    https://rbfirehose.com/2026/04/27/the-register-stale-gov-uk-pages-are-feeding-ai-overviews-old-data-and-brits-are-believing-it/
  8. Join us on April 29, 4:30-5:30pm Eastern for a DRP Volunteer-led workshop on #webscraping! This will be a #handson opportunity to expand your technical skills to save data (or, as our volunteers put it, get your very own rocket booster!). Learn more: www.datarescueproject.org/web-scraping...

    Scrapers, Pipelines, and AI, O...

  9. RT @TheAhmadOsman: PRO-TIPP Mein Agent Web-Stack - SearXNG: Entdeckung potenzieller Quellen - Firecrawl: Scraping und Crawling bekannter URLs - Camofox: Browser-Fallback für JS/Interaktion Suchen - Extrahieren - Interagieren P.S. Gib dies deinem bevorzugten Agenten und sage ihm, er soll diese Tools zur Nutzung mit lokalen Modellen einrichten. Ahmad (@TheAhmadOsman) Nutzt du lokale LLMs? Stelle sicher, dass du die Websuche für sie einrichtest. Sag deinem bevorzugten Agenten, er soll SearNg für dich einrichten. Gib das deinen lokalen LLMs (sag einem Agenten, dass er das ebenfalls einrichten soll). Beobachte, wie sie viel intelligenter und effizienter werden. Bitte sehr — nitter.net/TheAhmadOsman/statu

    mehr auf Arint.info

    #AIAgents #LLM #LocalAI #TechStack #WebScraping #arint_info

    https://x.com/TheAhmadOsman/status/2044142893242204550#m

  10. Initial questions about this #retraction risk calculator:

    What about negative phrases in social media posts that are NOT actually about the linked article?

    What about negative posts about a paper that don’t actually link to the paper? (Screenshots, “link in comment” posts, etc.)

    #NLP #sentimentAnalysis #webScraping #bibliometrics #Altmetric #stats

  11. Oh joy, another #GitHub repository rehashing the same #overhyped #AI tricks we've seen a thousand times before. 🚀👏 Now with 100% more #TypeScript to make sure your web scraping dreams are both verbose and complicated! 🏆🎉
    github.com/lightfeed/extractor #WebScraping #HackerNews #ngated

  12. 🤔 Ah, the noble pursuit of creating an #SDK that scrambles HTML like an egg to "outsmart" scrapers. Because nothing screams #innovation like making your website look like it went through a blender. 😅 Bravo for single-handedly advancing the art of content protection to new heights of absurdity! 🙌
    obscrd.dev/ #ContentProtection #WebScraping #Absurdity #HackerNews #ngated

  13. CNBC: Amazon wins court order to block Perplexity’s AI shopping agent. “Amazon sued Perplexity in November, alleging the startup took steps to ‘conceal’ its AI agents so they could continue to scrape the online retailer’s website without its approval. Perplexity called the lawsuit, which was filed in U.S. District Court in the Northern District of California, a ‘bully tactic.'”

    https://rbfirehose.com/2026/03/11/cnbc-amazon-wins-court-order-to-block-perplexitys-ai-shopping-agent/
  14. For decades we had internet #spam bots (whether in email, blog comments etc) but we sort of coped, shrugged and moved on. The same with #webscraping. This dumb automation did not challenge our core assumptions about how online networks can be organized, what level of trust is required for the thing to work as infrastructure connecting humans.

    But that era is over: Now "AI" bots bringing down servers, impersonate people, replicate #opensource projects.