#commoncrawl — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #commoncrawl, aggregated by home.social.
-
ICYMI: News publishers target Common Crawl, the AI training data backdoor: News/Media Alliance sent a formal letter to Common Crawl demanding it stop unauthorized scraping and block AI companies from using news content for training. https://ppc.land/news-publishers-target-common-crawl-the-ai-training-data-backdoor/ #AI #NewsMedia #CommonCrawl #DataPrivacy #WebScraping
-
https://winbuzzer.com/2026/02/15/publishers-block-internet-archive-ai-scraping-fears-xcxwbn/
Publishers Block Internet Archive Over AI Scraping Fears
#AI #WaybackMachine #InternetArchive #Google #Reddit #OpenAI #BigTech #TheNewYorkTimes #NewsPublishers #AIScraping #OpenWeb #CommonCrawl #PerplexityAI #Media
-
📸🤦♂️ Nathan Rooy discovers that flashy websites are like McDonald's cheeseburgers: popular for being just "good enough." Instead of a gourmet web experience, it's a buffet of #mediocrity sourced from Common Crawl's greatest hits. Web connoisseurs, prepare to feast on the bland! 🍔💻
https://nry.me/posts/2025-10-09/small-web-screenshots/ #flashywebsites #webdesign #cheeseburgers #CommonCrawl #HackerNews #ngated -
The Company Quietly Funneling #Paywalled Articles to #AI Developers
#CommonCrawl's website states that it scrapes the internet for "freely available content" without "going behind any '#paywall.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their #LLMs on high-quality journalism for free.
In #2020, #OpenAI used Common Crawl’s archives to train #GPT3.
https://www.msn.com/en-us/money/news/the-company-quietly-funneling-paywalled-articles-to-ai-developers/ar-AA1PMBHE -
Mashable: Common Crawl accused of feeding paywalled content to AI companies. “In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes.”
-
Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good commoncrawl.org/blog/setting-t… #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)
-
The Nonprofit Doing the AI Industry’s Dirty Work https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/ #tech #AI #CommonCrawl #PrivacyRights #TechRegulation #SiliconValley #BigBrother
-
„Auch Roboter sind Menschen.“, das sagt Rich Skrenta, Geschäftsführer von Common Crawl – einer gemeinnützigen Organisation, die Milliarden von Webseiten durchsucht und angeblich eine Hintertür für KI-Modelle geschaffen hat, um diese heimlich mit Artikeln hinter Bezahlschranken zu trainieren. Skrenta erklärte gegenüber „The Atlantic“ in „The Nonprofit Doing the AI Industry’s Dirty Work“ v. 04.11.2025, Anfragen zur Entfernung solcher Inhalte aus der Datenbank seien „total lästig“, und argumentiert, Bots sollten kostenlos alles lesen dürfen.
Kunden von Common Crawl sind u.a. OpenAI, Google, Anthropic, Nvidia, Meta und Amazon.
AI steals everything, everywhere...mehr fällt mir dazu gerade nicht ein, ausser: 🤮🤮🤮🤮🤮
#kishit #ki #ai #openai #google #meta #amazon #anthropic #nvidia #commoncrawl
-
Mehrere französische #Medienhäuser protestieren gegen die unautorisierte Nutzung ihrer Inhalte durch #KI-Systeme.
Besonders im Fokus stehen frei zugängliche Datenbanken wie #CommonCrawl, deren Inhalte zum Training von #Sprachmodellen genutzt werden.
Die #Verlage fordern die Entfernung urheberrechtlich geschützter Inhalte und kündigen rechtliche Schritte an.
#Urheberrecht #KünstlicheIntelligenz #Frankreich #Verwertungsrechte
-
#CommonCrawl extract in collaboration with the Dutch Language Institute (@ivdnt)
From trillions to usable: how #GPTNL (TNO, @SURF & NFI) responsibly filters web data
It’s an article in the Dutch language, but just translate it (automatically) if you’re curious about how to fairly obtain data for a large language model!
-
Testimony from #CommonCrawl https://commoncrawl.org/ people. Blocking bots is more and more common, for instance by Cloudflare, so research projects like CommonCrawl (bot "CCBot") suffer.
147 regular expressions to identify and classify refusals.
HTTP status code can be wrong, such as the 430 returned by Shopify. Or 429 returned for non-transient refusals.
Many sites can be unreachable because they are centralized under one company, like Newfold Digital.
-
“Filtering of #GenAI training datasets on #CommonCrawl data is often very basic. Example: C4, filtering out sources that contains naughty words. This leaves violence in, and removes a lot of LGBTQIA content.” — @tootbaack
Full paper presentation: https://mozilla.social/@tootbaack/111885437306039860
#AoIR -
Starting to see (and getting a bit excited about) some components of openwebsearch.eu, and I was wondering if the EU will finally get its own Common Crawl, like dataset (commoncrawl.org).
It seems the crawling results aren't publicly accessible yet, and there's already some discussion about GDPR implications.
At this pace, we're still far from being able to compete with US-scale open data efforts 🤦♂️
#europe #commoncrawl #openwebsearch🔗 https://pipeline.shared-search.eu/
🔗 https://pipeline.shared-search.eu/explain/license.html -
Woohoo! #CommonCrawl has bumped the truncation limit to 5MiB from 1MiB, which means that the number of truncated PDFs has gone from ~26% down to 7%.
This is critical for other binary formats as well.
Thank you #CommonCrawl!
https://commoncrawl.org/blog/march-2025-crawl-archive-now-available
-
Researchers have uncovered nearly 12,000 private API keys and passwords embedded within the Common Crawl dataset; an open-source repository of web data used by leading AI developers to train their AI models.
https://www.computing.co.uk/news/2025/security/ai-training-dataset-leaks-api-keys-and-passwords
-
It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?
I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?
I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?
We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?
To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.
Some concerning #news articles on the topic:
-
@denschub @beep And #CommonCrawl have good info in their FAQ on how to eg. handle the intensity of crawling that it does: https://commoncrawl.org/faq
And IP-ranges that enables one to verify that it’s the actual Common Crawl bot, if one wants to give it preferential treatment over for-profit bots
-
Everything you wanted to know but were afraid to ask about #CommonCrawl.
Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI https://piv.al/411QdzC
This is an important and approachable read for anyone interested in understanding LLMs.
-
It's worth pointing out the role of #commoncrawl in all of this. Their aim was "beneficial": instead of every research group scraping the web separately (hammering all our servers), they decided to do it once as a public pool of data for research. But:
(a) they did nothing to help respect authors' licensing (e.g. "no-derivatives"/"share-alike" #creativecommons choices);
(b) they hide behind US "fair use" law, but they do nothing to ensure the data will only be used for fair-use purposes. -
Fediverse images & #alttext will certainly be scraped by groups to train their AIs on image-text correspondence. I'm sure it will be happening already. (Yes, many tools can already generate crappy alttext, but high-quality paired data is *valuable* in ML.) Thanks to the precedent set by #commoncrawl and #LLMs, our copyrights and licence terms will be ignored even when explicitly asserted. Case law is not strong enough (nor international).
-
#Moshi : un #chatbot #text-to-speech #OpenSource #français 🇨🇵 , qui pense et parle en même temps , est un #prototype de #modèle #IA #AI développé par #Kyutai , financé entre autres par #XavierNiel ( #Free ).
Il a été pré-entrainé avec des données #Hélium , issues de projets tels que #CommonCrawl
Pour le moment seulement en #anglais , avec plusieurs #accents possibles ,bientôt en #Français 🇨🇵 et #Espagnol 🇪🇦 (? 😍😍 )
https://www.01net.com/actualites/moshi-pour-son-patron-le-chatgpt-francais-est-un-exploit.html
-
Publishers Target Common Crawl In Fight Over #AI Training Data
Long-running nonprofit #CommonCrawl has been a boon to researchers for years. But now its role in AI training data has triggered backlash from publishers.
https://www.wired.com/story/the-fight-against-ai-comes-to-a-foundational-data-set/
-
Mark Zuckerberg’s Secret Weapon for AI Is Your Facebook Data https://tech.hindustantimes.com/opinion/mark-zuckerberg-s-secret-weapon-for-ai-is-your-facebook-data-71707243340870.html (originally on Bloomberg at https://www.bloomberg.com/opinion/articles/2024-02-06/zuckerberg-s-plan-for-ai-hinges-on-your-facebook-and-instagram-data)
Frikkin' HATE it when academic mentions don't include a citation. Here's the paper: https://aclanthology.org/2021.acl-short.24/ (What’s in the Box?)
-
Most generative AI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development #commoncrawl #ai #generativeAI #llm #datagovernance #sts https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl (1/10)