home.social

#webscraping — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #webscraping, aggregated by home.social.

  1. Fedi, I need your input! Since the web is dead and googling "which user agent to use for responsible web scraping" mostly returns AI-generated garbage promoting how to spoof the user agent for non-responsible web scraping: What are your best practices? Any guide you would recommend? #FediHelp #WebScraping #DigitalHumanities

  2. Heurísticas para Web Scraping co…

    Los Modelos de Lenguaje de Gran Escala (LLMs) son sistemas de inteligencia artificial diseñados para entender y generar texto. Su aplicación en web scraping radica en la capacidad de analizar y extraer datos de páginas web complejas.

    norvik.tech/news/analisis-llms

    #Technology #WebScraping #Llms #Heuristicas #DesarrolloWeb #NorvikTech #DesarrolloSoftware #TechInnovation

  3. ICYMI: News publishers target Common Crawl, the AI training data backdoor: News/Media Alliance sent a formal letter to Common Crawl demanding it stop unauthorized scraping and block AI companies from using news content for training. ppc.land/news-publishers-targe #AI #NewsMedia #CommonCrawl #DataPrivacy #WebScraping

  4. The Register: Stale gov.uk pages are feeding AI overviews old data and Brits are believing it. “AI overviews from the likes of Google are serving up false summaries of UK government information by drawing on stale GOV.UK pages, according to content designers at the Department for Business and Trade (DBT). The problem, senior content designer Giorgio Di Tunno and content operations lead Neil […]

    https://rbfirehose.com/2026/04/27/the-register-stale-gov-uk-pages-are-feeding-ai-overviews-old-data-and-brits-are-believing-it/
  5. Join us on April 29, 4:30-5:30pm Eastern for a DRP Volunteer-led workshop on #webscraping! This will be a #handson opportunity to expand your technical skills to save data (or, as our volunteers put it, get your very own rocket booster!). Learn more: www.datarescueproject.org/web-scraping...

    Scrapers, Pipelines, and AI, O...

  6. RT @TheAhmadOsman: PRO-TIPP Mein Agent Web-Stack - SearXNG: Entdeckung potenzieller Quellen - Firecrawl: Scraping und Crawling bekannter URLs - Camofox: Browser-Fallback für JS/Interaktion Suchen - Extrahieren - Interagieren P.S. Gib dies deinem bevorzugten Agenten und sage ihm, er soll diese Tools zur Nutzung mit lokalen Modellen einrichten. Ahmad (@TheAhmadOsman) Nutzt du lokale LLMs? Stelle sicher, dass du die Websuche für sie einrichtest. Sag deinem bevorzugten Agenten, er soll SearNg für dich einrichten. Gib das deinen lokalen LLMs (sag einem Agenten, dass er das ebenfalls einrichten soll). Beobachte, wie sie viel intelligenter und effizienter werden. Bitte sehr — nitter.net/TheAhmadOsman/statu

    mehr auf Arint.info

    #AIAgents #LLM #LocalAI #TechStack #WebScraping #arint_info

    https://x.com/TheAhmadOsman/status/2044142893242204550#m

  7. Initial questions about this #retraction risk calculator:

    What about negative phrases in social media posts that are NOT actually about the linked article?

    What about negative posts about a paper that don’t actually link to the paper? (Screenshots, “link in comment” posts, etc.)

    #NLP #sentimentAnalysis #webScraping #bibliometrics #Altmetric #stats

  8. Oh joy, another #GitHub repository rehashing the same #overhyped #AI tricks we've seen a thousand times before. 🚀👏 Now with 100% more #TypeScript to make sure your web scraping dreams are both verbose and complicated! 🏆🎉
    github.com/lightfeed/extractor #WebScraping #HackerNews #ngated

  9. 🤔 Ah, the noble pursuit of creating an #SDK that scrambles HTML like an egg to "outsmart" scrapers. Because nothing screams #innovation like making your website look like it went through a blender. 😅 Bravo for single-handedly advancing the art of content protection to new heights of absurdity! 🙌
    obscrd.dev/ #ContentProtection #WebScraping #Absurdity #HackerNews #ngated

  10. TechSpot: Smart TV apps are quietly scraping web data for AI training. “Companies specializing in scraping or otherwise harvesting publicly available content to train AI models are becoming increasingly common. In particular, some firms are targeting smart TV applications and similar platforms, attempting to leverage users’ internet connectivity in exchange for low-cost incentives such as […]

    https://rbfirehose.com/2026/03/04/techspot-smart-tv-apps-are-quietly-scraping-web-data-for-ai-training/
  11. Apropos of content heists…

    DIY anti-scraping movement, why bother blocking when you can’t win? Poison instead. alexschroeder.ch/view/2026-02-

    #webscraping #datapoisoning #aitraining #ai

  12. The Register: SerpApi says Google is the pot calling the kettle black when it comes to scraping. “SerpApi, a Texas-based web scraping company, has asked a California court to dismiss Google’s claim that that it bypassed digital locks to gather copyrighted content in Google Search results.”

    https://rbfirehose.com/2026/02/22/the-register-serpapi-says-google-is-the-pot-calling-the-kettle-black-when-it-comes-to-scraping/
  13. Maxun v0.0.32 ra mắt với tính năng ghi âm thời gian thực, hỗ trợ đồng bộ trạng thái website thực tế, thao tác live như gõ, nhấn, cuộn, điều hướng. Hỗ trợ tích hợp SDK: LlamaIndex, Google Sheets, Airtable, LangChain, OpenAI và nhiều hơn nữa. Chế độ AI tự động tìm và trích xuất dữ liệu mà không cần URL. Mã nguồn mở, tự lưu trữ. #Maxun #WebScraping #OpenSource #SelfHosted #AI #LlamaIndex #LangChain #NoCode #DataExtraction #CôngCụMãNguồnMở #TríchXuấtDữLiệu #AI #TựHost

    reddit.com/r/selfh

  14. Search Engine Land: Does llms.txt matter? We tracked 10 sites to find out. “We wanted data, not debates. So we tracked llms.txt adoption across 10 sites in finance, B2B SaaS, ecommerce, insurance, and pet care — 90 days before implementation and 90 days after.”

    https://rbfirehose.com/2026/01/22/search-engine-land-does-llms-txt-matter-we-tracked-10-sites-to-find-out/
  15. Search Engine Roundtable: Google Sued SerpApi Over Scraping Search Results. “On Friday, Google announced it had filed a lawsuit (PDF) against SerpApi for scraping the Google search results. Google alleges that SerpApi is running an ‘unlawful’ operation that bypasses Google’s security measures to scrape search results at an astonishing scale.”

    https://rbfirehose.com/2025/12/20/search-engine-roundtable-google-sued-serpapi-over-scraping-search-results/
  16. Search Engine Roundtable: OpenAI Scales Up Crawling & Bots For The Holidays. “OpenAI is reportedly scaling up its crawling infrastructure for the holiday shopping season. The folks at Merj noticed OpenAI adding a lot of new IP ranges for its bots and crawlers.”

    https://rbfirehose.com/2025/12/01/search-engine-roundtable-openai-scales-up-crawling-bots-for-the-holidays/

  17. Mashable: Common Crawl accused of feeding paywalled content to AI companies. “In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes.”

    https://rbfirehose.com/2025/11/09/mashable-common-crawl-accused-of-feeding-paywalled-content-to-ai-companies/

  18. Craigslist isn’t just a marketplace — it’s a goldmine of data.
    By scraping Craigslist listings, businesses unlock real-time market intelligence:

    🏠 Real estate pros track pricing trends
    💼 Recruiters analyze job demand
    🚗 Auto dealers monitor competitive pricing

    Learn more: webscreenscraping.com/how-crai

    Smarter insights. Sharper strategies. All powered by data. 💡

    #WebScraping #CraigslistData #MarketIntelligence #DataAnalytics #PropTech #RecruitmentTrends #AutoIndustry #BusinessGrowth #AIInsights

  19. Wow, someone finally cracked the code on turning URLs into RSS feeds 🔮, using the arcane magic of CSS selectors 🧙‍♂️. Apparently, it's "quickly" done if you have a PhD in web scraping and a deep spiritual connection with the spirit of XML 😂. Now you too can relive the 2000s dream of reading your news the way your grandparents never did 🌐.
    feedmaker.fly.dev #URLtoRSS #CSSselectors #WebScraping #XMLmagic #2000sNostalgia #HackerNews #ngated

  20. "In our piece exploring whether the AI revolution is leaving APIs behind, we wrote about some of the factors limiting the extent to which AI tools like chatbots can interface with APIs.

    Some of these include:

    - Limited or no access to APIs for developers
    - APIs are sometimes overcomplicated, bloated, or difficult to call
    - Legacy APIs (WS/RPC) lack thorough or up-to-date documentation
    - APIs sometimes only cover a fraction of the functions available via the UI

    It’s worth noting that many of these points impact human API consumers just as much as they do agentic ones. If you’ve ever been in the position of trying to use an API and it falling short of your expectations, you’ll know just how frustrating it can be.

    While it’s possible that some of those users will get in touch to ask you to add certain endpoints or clarify things, plenty more won’t. Some developers are more likely to take the view that it’s easier to ask for forgiveness later than permission now, and find some other way to extract the data they’re looking for. In many cases, web scraping offers just such a solution.

    Web scraping APIs are a natural evolution of manual scraping techniques, such as using Python to scrape websites. Used for everything from scraping search engine results, like SERP APIs, to product prices and sentiment analysis, there are various services out there that make web scraping very straightforward. And they’re big business."

    nordicapis.com/are-web-scrapin

    #APIs #WebScraping #SoftwareDevelopment #Programming #APIDocumentation #APIDesign #Python

  21. Making the most out of a small LLM

    Yesterday i finally built my own #AI #server. I had a spare #Nvidia RTX 2070 with 8GB of #VRAM laying around and wanted to do this for a long time.

    The problem is that most #LLMs need a lot of VRAM and i don't want to buy another #GPU just to host my own AI. Then i came across #gemma3 and #qwen3. Both of these are amazing #quantized models with stunning reasoning given that they need so less resources.

    I chose huihui_ai/qwen3-abliterated:14b since it supports #deepthinking, #toolcalling and is pretty unrestricted. After some testing i noticed that the 8b model performs even better than the 14b variant with drastically better performance. I can't make out any quality loss there to be honest. The 14b model sneaked in chinese characters into the response very often. The 8b model on the other hand doesn't.

    Now i've got a very fast model with amazing reasoning (even in German) and tool calling support. The only thing left to improve is knowledge. #Firecrawl is a great tool for #webscraping and as soon as i implemented websearching, the setup was complete. At least i thought it was.

    I want to make the most out of this LLM and therefore my next step is to implement a basic #webserver that exposes the same #API #endpoints as #ollama so that everywhere ollama is supported, i can point it to my python script instead. This way it feels like the model is way more capable than it actually is. I can use these advanced features everywhere without being bound to it's actual knowledge.

    To improve this setup even more i will likely switch to a #mixture_of_experts architecture soon. This project is a lot of fun and i can't wait to integrate it into my homelab.

    #homelab #selfhosting #privacy #ai #llm #largelanguagemodels #coding #developement

  22. Need to grab specific info from a webpage regularly? 🤔 Browser Actions can help! Create a Shortcut to: Open URL ➡️ Wait for data element ➡️ Run JavaScript to extract text ➡️ Pass it back to Shortcuts!

    If you need help with that, just follow the Forum link on the site!

    actions.work/browser-actions?r

    #macOS #Shortcuts #WebScraping #DataExtraction #BrowserAutomation

  23. 2/

    Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

    If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

    ...

    And, getting data in JSON (key-value pairs) is definitely NOT scraping — as JSON's purpose is to communicate data in a machine-legible manner.

    CC: @404mediaco

    #Scraper #Scraping #WebScraper #WebScraping

  24. TorrentFreak: Alleged Anna’s Archive Operator Dropped from U.S. ‘Scraping’ Lawsuit. “American nonprofit OCLC sued Anna’s Archive last year for alleged hacking and unauthorized publishing of its WorldCat database. The sole named defendant in the case, an archivist from the Seattle area, denied any involvement with the site. After the court referred several scraping-related questions to […]

    https://rbfirehose.com/2025/04/18/torrentfreak-alleged-annas-archive-operator-dropped-from-u-s-scraping-lawsuit/

  25. CW: web scraper

    5/

    For example, if software request data from a web-site, and the web-site returns HTML, but parts of the HTML has semantics marked up with a machine-legible format such as microformats, microdata, RDFa, etc, then it is NOT scraping.

    (microformats, microdata, RDFa, etc, are machine-legible format, designed to express semantics to machines.)

    #Scraper #Scraping #WebScraper #WebScraping

  26. CW: web scraper

    4/

    For example, if software request data from a web-site, and the web-site returns HTML, but that HTML contains a <script> tag with JSON-LD in it, and the software consumes that JSON-LD, then it is NOT scraping.

    (JSON-LD is a machine-legible format, designed to express semantics to machines.)

    #Scraper #Scraping #WebScraper #WebScraping

  27. CW: web scraper

    3/

    For example, if software request data from a web-site, and the web-site returns JSON, XML, or some other machine-legible format, then it is NOT scraping.

    #Scraper #Scraping #WebScraper #WebScraping

  28. CW: web scraper

    2/

    Scraping (as in Web Scraping) is the act of extracting data from HTML web-pages where the data is NOT machine-legible.

    If the data, even in an HTML web-page, is in a machine-legible format, then it is NOT scraping.

    #Scraper #Scraping #WebScraper #WebScraping

  29. CW: web scraper

    1/

    I am understanding when a non-technical person uses the noun "scraper" (as in "web scraper") or the verb "scrape" in a way that isn't accurate.

    But, I am surprised when what seems to be a technical person uses the word "scraper", "scrape", or "scraping" inaccurately — either claiming things that are NOT scrapers to be scrapers, or claiming that acts that are NOT scraping are scraping.

    ...

    #Scraper #Scraping #WebScraper #WebScraping

  30. Ars Technica: Open Source devs say AI crawlers dominate traffic, forcing blocks on entire countries. “Software developer Xe Iaso reached a breaking point earlier this year when aggressive AI crawler traffic from Amazon overwhelmed their Git repository service, repeatedly causing instability and downtime. Despite configuring standard defensive measures—adjusting robots.txt, blocking known […]

    https://rbfirehose.com/2025/03/26/ars-technica-open-source-devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/

  31. Reuters: News Corp sued by Brave Software, a Google search engine rival. “News Corp has been sued by Google search engine rival Brave Software, which seeks to forestall a lawsuit by Rupert Murdoch’s company for when readers are directed to copyrighted articles from the Wall Street Journal and New York Post.”

    https://rbfirehose.com/2025/03/15/reuters-news-corp-sued-by-brave-software-a-google-search-engine-rival/

  32. Hey does anyone know if there's still a working zip bomb style exploit that can be deployed on a static site/JS (or as a asset/resource)? Specifically to target web scrapers and AI bullshit? The second any server goes online now it's immediately bombarded by stupid numbers of requests.

    #hacking #aislop #crawlers #webscraping #webcrawler #robots #zipbomb #zipbombing #exploit #robotstxt #server #scraper