#web-scraping — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #web-scraping, aggregated by home.social.
-
🎉 Look, another web scraper! 🎉 Because we *definitely* needed one more tool to fetch JSON from #Wikipedia faster than a cheetah on Red Bull. 🐆💨 No doubt, this will revolutionize the already groundbreaking field of #scraping celebrity birthdates. 🙄✨
https://scrapewithruno.com/ #webscraping #JSONtools #celebritybirthdates #technology #HackerNews #ngated -
Some scraping APIs fail on 62% of requests. We benchmarked 12 tools against protected sites so you don't waste budget! https://hackernoon.com/web-scraping-api-success-rates-we-tested-12-tools-so-you-dont-have-to #webscraping
-
Fedi, I need your input! Since the web is dead and googling "which user agent to use for responsible web scraping" mostly returns AI-generated garbage promoting how to spoof the user agent for non-responsible web scraping: What are your best practices? Any guide you would recommend? #FediHelp #WebScraping #DigitalHumanities
-
Nuovo browser stealth per automazione: compatibilità con Playwright e Puppeteer, fingerprint al livello C++ e comportamenti umanizzati. #automation #browser #webscraping #developers #Linux #Docker
-
Heurísticas para Web Scraping co…
Los Modelos de Lenguaje de Gran Escala (LLMs) son sistemas de inteligencia artificial diseñados para entender y generar texto. Su aplicación en web scraping radica en la capacidad de analizar y extraer datos de páginas web complejas.
https://norvik.tech/news/analisis-llms-para-web-scraping
#Technology #WebScraping #Llms #Heuristicas #DesarrolloWeb #NorvikTech #DesarrolloSoftware #TechInnovation
-
37% → 78%. Doubled my web scraper's success rate by swapping requests for curl_cffi to mimic Chrome's TLS handshake.
Bonus: deleting 22 lines of "defensive" header overrides added another 2pp. They were undermining the impersonation.
Modern WAFs fingerprint TLS ClientHello and HTTP/2 SETTINGS, not User-Agents.
-
El lado del mal - El Captcha Cognitivo de TikTok al estilo de "The Secret of the Monkey Island" que resuelve la Inteligencia Artificial https://www.elladodelmal.com/2026/05/el-captcha-cognitivo-de-tiktok-al.html #Captcha #IA #AI #InteligenciaArtificial #Cognitive #Gemini #MonkeyIsland #Hacking #Pentesting #Webscraping #TikTok
-
ICYMI: News publishers target Common Crawl, the AI training data backdoor: News/Media Alliance sent a formal letter to Common Crawl demanding it stop unauthorized scraping and block AI companies from using news content for training. https://ppc.land/news-publishers-target-common-crawl-the-ai-training-data-backdoor/ #AI #NewsMedia #CommonCrawl #DataPrivacy #WebScraping
-
The Register: Stale gov.uk pages are feeding AI overviews old data and Brits are believing it. “AI overviews from the likes of Google are serving up false summaries of UK government information by drawing on stale GOV.UK pages, according to content designers at the Department for Business and Trade (DBT). The problem, senior content designer Giorgio Di Tunno and content operations lead Neil […]
https://rbfirehose.com/2026/04/27/the-register-stale-gov-uk-pages-are-feeding-ai-overviews-old-data-and-brits-are-believing-it/ -
Join us on April 29, 4:30-5:30pm Eastern for a DRP Volunteer-led workshop on #webscraping! This will be a #handson opportunity to expand your technical skills to save data (or, as our volunteers put it, get your very own rocket booster!). Learn more: www.datarescueproject.org/web-scraping...
Scrapers, Pipelines, and AI, O... -
RT @TheAhmadOsman: PRO-TIPP Mein Agent Web-Stack - SearXNG: Entdeckung potenzieller Quellen - Firecrawl: Scraping und Crawling bekannter URLs - Camofox: Browser-Fallback für JS/Interaktion Suchen - Extrahieren - Interagieren P.S. Gib dies deinem bevorzugten Agenten und sage ihm, er soll diese Tools zur Nutzung mit lokalen Modellen einrichten. Ahmad (@TheAhmadOsman) Nutzt du lokale LLMs? Stelle sicher, dass du die Websuche für sie einrichtest. Sag deinem bevorzugten Agenten, er soll SearNg für dich einrichten. Gib das deinen lokalen LLMs (sag einem Agenten, dass er das ebenfalls einrichten soll). Beobachte, wie sie viel intelligenter und effizienter werden. Bitte sehr — https://nitter.net/TheAhmadOsman/status/2043884414774505741#m
mehr auf Arint.info
-
The Internet's Most Powerful Archiving Tool Is in Peril www.wired.com/story/the-inte… #WaybackMachine #InternetArchive #journalism #WebScraping
-
Initial questions about this #retraction risk calculator:
What about negative phrases in social media posts that are NOT actually about the linked article?
What about negative posts about a paper that don’t actually link to the paper? (Screenshots, “link in comment” posts, etc.)
#NLP #sentimentAnalysis #webScraping #bibliometrics #Altmetric #stats
-
Find out how parsing has moved upstream to support accurate data collection https://hackernoon.com/parsing-as-response-validation-a-new-necessity-for-scraping #webscraping
-
Quo Vadis, Crawlers? Progress and what’s next on safeguarding our infrastructure https://diff.wikimedia.org/2026/03/26/quo-vadis-crawlers-progress-and-whats-next-on-safeguarding-our-infrastructure/ #AI, #AIDataCrawlers, #Crawlers, #Infrastructure, #Knowledge, #KnowledgeAsAService, #Scraping, #ScrapingBots, #WebScraping, #WikimediaFoundation, #WikimediaProjects
-
Oh joy, another #GitHub repository rehashing the same #overhyped #AI tricks we've seen a thousand times before. 🚀👏 Now with 100% more #TypeScript to make sure your web scraping dreams are both verbose and complicated! 🏆🎉
https://github.com/lightfeed/extractor #WebScraping #HackerNews #ngated -
🤔 Ah, the noble pursuit of creating an #SDK that scrambles HTML like an egg to "outsmart" scrapers. Because nothing screams #innovation like making your website look like it went through a blender. 😅 Bravo for single-handedly advancing the art of content protection to new heights of absurdity! 🙌
https://www.obscrd.dev/ #ContentProtection #WebScraping #Absurdity #HackerNews #ngated -
I built an SDK that scrambles HTML so scrapers get garbage
#HackerNews #SDK #HTML #Scrambling #WebScraping #DeveloperTools #Privacy
-
CNBC: Amazon wins court order to block Perplexity’s AI shopping agent. “Amazon sued Perplexity in November, alleging the startup took steps to ‘conceal’ its AI agents so they could continue to scrape the online retailer’s website without its approval. Perplexity called the lawsuit, which was filed in U.S. District Court in the Northern District of California, a ‘bully tactic.'”
https://rbfirehose.com/2026/03/11/cnbc-amazon-wins-court-order-to-block-perplexitys-ai-shopping-agent/ -
For decades we had internet #spam bots (whether in email, blog comments etc) but we sort of coped, shrugged and moved on. The same with #webscraping. This dumb automation did not challenge our core assumptions about how online networks can be organized, what level of trust is required for the thing to work as infrastructure connecting humans.
But that era is over: Now "AI" bots bringing down servers, impersonate people, replicate #opensource projects.