#commoncrawl — Public Fediverse posts on home.social

PPC Land @[email protected] · 2026-05-02 · 20:29 UTC

ICYMI: News publishers target Common Crawl, the AI training data backdoor: News/Media Alliance sent a formal letter to Common Crawl demanding it stop unauthorized scraping and block AI companies from using news content for training. https://ppc.land/news-publishers-target-common-crawl-the-ai-training-data-backdoor/ #AI #NewsMedia #CommonCrawl #DataPrivacy #WebScraping

#ai #newsmedia #commoncrawl #dataprivacy #webscraping

Winbuzzer @[email protected] · 2026-02-16 · 17:05 UTC

https://winbuzzer.com/2026/02/15/publishers-block-internet-archive-ai-scraping-fears-xcxwbn/

Publishers Block Internet Archive Over AI Scraping Fears

#AI #WaybackMachine #InternetArchive #Google #Reddit #OpenAI #BigTech #TheNewYorkTimes #NewsPublishers #AIScraping #OpenWeb #CommonCrawl #PerplexityAI #Media

#ai #waybackmachine #internetarchive #google #reddit #openai

N-gated Hacker News @[email protected] · 2025-12-27 · 01:53 UTC

📸🤦‍♂️ Nathan Rooy discovers that flashy websites are like McDonald's cheeseburgers: popular for being just "good enough." Instead of a gourmet web experience, it's a buffet of #mediocrity sourced from Common Crawl's greatest hits. Web connoisseurs, prepare to feast on the bland! 🍔💻
https://nry.me/posts/2025-10-09/small-web-screenshots/ #flashywebsites #webdesign #cheeseburgers #CommonCrawl #HackerNews #ngated

#mediocrity #flashywebsites #webdesign #cheeseburgers #commoncrawl #hackernews

Benjamin Carr, Ph.D. 👨🏻‍💻🧬 @[email protected] · 2025-12-19 · 22:21 UTC

The Company Quietly Funneling #Paywalled Articles to #AI Developers
#CommonCrawl's website states that it scrapes the internet for "freely available content" without "going behind any '#paywall.'" Yet the organization has taken articles from major news websites that people normally have to pay for — allowing AI companies to train their #LLMs on high-quality journalism for free.
In #2020, #OpenAI used Common Crawl’s archives to train #GPT3.
https://www.msn.com/en-us/money/news/the-company-quietly-funneling-paywalled-articles-to-ai-developers/ar-AA1PMBHE

#paywalled #ai #commoncrawl #paywall #llms #openai

ResearchBuzz: Firehose @[email protected] · 2025-11-09 · 10:07 UTC

Mashable: Common Crawl accused of feeding paywalled content to AI companies. “In a detailed investigation for The Atlantic, reporter Alex Reisner reveals that several major AI companies have quietly partnered with the Common Crawl Foundation — a nonprofit that scrapes the web to build a massive public archive of the internet for research purposes.”

https://rbfirehose.com/2025/11/09/mashable-common-crawl-accused-of-feeding-paywalled-content-to-ai-companies/

#ai #aitraining #aiassisted #commoncrawl #copyright #intellectualproperty

Nicole Hennig @[email protected] · 2025-11-07 · 23:27 UTC

Common Crawl - Setting the Record Straight: Common Crawl’s Commitment to Transparency, Fair Use, and the Public Good commoncrawl.org/blog/setting-t… #AI #CommonCrawl #data #WebArchiving (wow, that Atlantic piece was bad, needing this rebuttal)

#ai #commoncrawl #data #webarchiving

Killer Rabbit 90 @[email protected] · 2025-11-05 · 00:58 UTC

The Nonprofit Doing the AI Industry’s Dirty Work https://www.theatlantic.com/technology/2025/11/common-crawl-ai-training-data/684567/ #tech #AI #CommonCrawl #PrivacyRights #TechRegulation #SiliconValley #BigBrother

#tech #ai #commoncrawl #privacyrights #techregulation #siliconvalley

teufelswerk @[email protected] · 2025-11-04 · 20:09 UTC

„Auch Roboter sind Menschen.“, das sagt Rich Skrenta, Geschäftsführer von Common Crawl – einer gemeinnützigen Organisation, die Milliarden von Webseiten durchsucht und angeblich eine Hintertür für KI-Modelle geschaffen hat, um diese heimlich mit Artikeln hinter Bezahlschranken zu trainieren. Skrenta erklärte gegenüber „The Atlantic“ in „The Nonprofit Doing the AI Industry’s Dirty Work“ v. 04.11.2025, Anfragen zur Entfernung solcher Inhalte aus der Datenbank seien „total lästig“, und argumentiert, Bots sollten kostenlos alles lesen dürfen.

Kunden von Common Crawl sind u.a. OpenAI, Google, Anthropic, Nvidia, Meta und Amazon.

AI steals everything, everywhere...mehr fällt mir dazu gerade nicht ein, ausser: 🤮🤮🤮🤮🤮

#kishit #ki #ai #openai #google #meta #amazon #anthropic #nvidia #commoncrawl

#kishit #ki #ai #openai #google #meta

Tino Eberl @[email protected] · 2025-09-14 · 08:10 UTC

Mehrere französische #Medienhäuser protestieren gegen die unautorisierte Nutzung ihrer Inhalte durch #KI-Systeme.

Besonders im Fokus stehen frei zugängliche Datenbanken wie #CommonCrawl, deren Inhalte zum Training von #Sprachmodellen genutzt werden.

Die #Verlage fordern die Entfernung urheberrechtlich geschützter Inhalte und kündigen rechtliche Schritte an.

https://www.n-tv.de/ticker/Frankreichs-Medien-protestieren-gegen-die-illegale-Nutzung-von-Inhalten-durch-die-KI-article26002424.html

#Urheberrecht #KünstlicheIntelligenz #Frankreich #Verwertungsrechte

#medienhauser #ki #commoncrawl #sprachmodellen #verlage #urheberrecht

Wladimir Mufty @[email protected] · 2025-08-20 · 13:56 UTC

#CommonCrawl extract in collaboration with the Dutch Language Institute (@ivdnt)

From trillions to usable: how #GPTNL (TNO, @SURF & NFI) responsibly filters web data

It’s an article in the Dutch language, but just translate it (automatically) if you’re curious about how to fairly obtain data for a large language model!

https://gpt-nl.nl/nieuws/common-crawl-samenwerking-ivdnt/

#commoncrawl #gptnl

Stéphane Bortzmeyer @[email protected] · 2025-07-23 · 07:48 UTC

Testimony from #CommonCrawl https://commoncrawl.org/ people. Blocking bots is more and more common, for instance by Cloudflare, so research projects like CommonCrawl (bot "CCBot") suffer.

147 regular expressions to identify and classify refusals.

HTTP status code can be wrong, such as the 430 returned by Shopify. Or 429 returned for non-transient refusals.

Many sites can be unreachable because they are centralized under one company, like Newfold Digital.

#IETF123

#commoncrawl #ietf123

Matti Schneider @[email protected] · 2025-06-04 · 09:40 UTC

“Filtering of #GenAI training datasets on #CommonCrawl data is often very basic. Example: C4, filtering out sources that contains naughty words. This leaves violence in, and removes a lot of LGBTQIA content.” — @tootbaack
Full paper presentation: https://mozilla.social/@tootbaack/111885437306039860
#AoIR

#genai #commoncrawl #aoir

Alexandre Dulaunoy @[email protected] · 2025-05-16 · 04:39 UTC

Starting to see (and getting a bit excited about) some components of openwebsearch.eu, and I was wondering if the EU will finally get its own Common Crawl, like dataset (commoncrawl.org).

It seems the crawling results aren't publicly accessible yet, and there's already some discussion about GDPR implications.

At this pace, we're still far from being able to compete with US-scale open data efforts 🤦‍♂️

#europe #commoncrawl #openwebsearch

🔗 https://pipeline.shared-search.eu/
🔗 https://pipeline.shared-search.eu/explain/license.html

#europe #commoncrawl #openwebsearch

WtfPdf @[email protected] · 2025-04-08 · 11:59 UTC

MOAR PDF! #CommonCrawl

https://mastodon.social/@tallison/114302223878264162

#commoncrawl

Tim Allison @[email protected] · 2025-04-08 · 11:58 UTC

Woohoo! #CommonCrawl has bumped the truncation limit to 5MiB from 1MiB, which means that the number of truncated PDFs has gone from ~26% down to 7%.

This is critical for other binary formats as well.

Thank you #CommonCrawl!

https://commoncrawl.org/blog/march-2025-crawl-archive-now-available

#commoncrawl

John Leonard @[email protected] · 2025-03-04 · 13:43 UTC

Researchers have uncovered nearly 12,000 private API keys and passwords embedded within the Common Crawl dataset; an open-source repository of web data used by leading AI developers to train their AI models.

https://www.computing.co.uk/news/2025/security/ai-training-dataset-leaks-api-keys-and-passwords

#technews #infosec #cybersecurity #ai #commoncrawl

Max Resing @[email protected] · 2025-01-11 · 12:52 UTC

It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?

I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?

I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?

We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?

To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.

Some concerning #news articles on the topic:

#webcrawling #crawler #web #opensource

#crawling #web #generativeai #opendata #cloud #operator

Pelle Wessman @[email protected] · 2024-12-31 · 01:14 UTC

@denschub @beep And #CommonCrawl have good info in their FAQ on how to eg. handle the intensity of crawling that it does: https://commoncrawl.org/faq

And IP-ranges that enables one to verify that it’s the actual Common Crawl bot, if one wants to give it preferential treatment over for-profit bots

#commoncrawl

Paul R. Pival (he/him) @[email protected] · 2024-11-25 · 20:46 UTC

Everything you wanted to know but were afraid to ask about #CommonCrawl.

Training Data for the Price of a Sandwich: Common Crawl’s Impact on Generative AI https://piv.al/411QdzC

This is an important and approachable read for anyone interested in understanding LLMs.

#AI

#commoncrawl #ai

Dan Stowell @[email protected] · 2024-08-30 · 09:59 UTC

It's worth pointing out the role of #commoncrawl in all of this. Their aim was "beneficial": instead of every research group scraping the web separately (hammering all our servers), they decided to do it once as a public pool of data for research. But:
(a) they did nothing to help respect authors' licensing (e.g. "no-derivatives"/"share-alike" #creativecommons choices);
(b) they hide behind US "fair use" law, but they do nothing to ensure the data will only be used for fair-use purposes.

#commoncrawl #creativecommons

Dan Stowell @[email protected] · 2024-08-30 · 09:50 UTC

Fediverse images & #alttext will certainly be scraped by groups to train their AIs on image-text correspondence. I'm sure it will be happening already. (Yes, many tools can already generate crappy alttext, but high-quality paired data is *valuable* in ML.) Thanks to the precedent set by #commoncrawl and #LLMs, our copyrights and licence terms will be ignored even when explicitly asserted. Case law is not strong enough (nor international).

#alttext #commoncrawl #llms

Rod2ik 🇪🇺 🇨🇵 🇪🇸 🇨🇱 🇺🇦 🇨🇦 🇬🇱☮🕊️ @[email protected] · 2024-08-04 · 16:32 UTC

#Moshi : un #chatbot #text-to-speech #OpenSource #français 🇨🇵 , qui pense et parle en même temps , est un #prototype de #modèle #IA #AI développé par #Kyutai , financé entre autres par #XavierNiel ( #Free ).

Il a été pré-entrainé avec des données #Hélium , issues de projets tels que #CommonCrawl

Pour le moment seulement en #anglais , avec plusieurs #accents possibles ,bientôt en #Français 🇨🇵 et #Espagnol 🇪🇦 (? 😍😍 )

https://www.01net.com/actualites/moshi-pour-son-patron-le-chatgpt-francais-est-un-exploit.html

#moshi #chatbot #text #opensource #prototype #modele

Tyler Smith @[email protected] · 2024-07-15 · 17:40 UTC

Publishers Target Common Crawl In Fight Over #AI Training Data

Long-running nonprofit #CommonCrawl has been a boon to researchers for years. But now its role in AI training data has triggered backlash from publishers.

https://www.wired.com/story/the-fight-against-ai-comes-to-a-foundational-data-set/

#wired

#ai #commoncrawl #wired

Paul R. Pival (he/him) @[email protected] · 2024-02-16 · 16:53 UTC

Mark Zuckerberg’s Secret Weapon for AI Is Your Facebook Data https://tech.hindustantimes.com/opinion/mark-zuckerberg-s-secret-weapon-for-ai-is-your-facebook-data-71707243340870.html (originally on Bloomberg at https://www.bloomberg.com/opinion/articles/2024-02-06/zuckerberg-s-plan-for-ai-hinges-on-your-facebook-and-instagram-data)

Frikkin' HATE it when academic mentions don't include a citation. Here's the paper: https://aclanthology.org/2021.acl-short.24/ (What’s in the Box?)

#AI #Facebook #CommonCrawl

#ai #facebook #commoncrawl

Stefan Baack (OLD) @[email protected] · 2024-02-06 · 16:18 UTC

Most generative AI models were trained on Common Crawl, a massive archive of web crawl data. Yet most people never heard of it. My new research studies Common Crawl in-depth and highlights its influence on LLM research and development #commoncrawl #ai #generativeAI #llm #datagovernance #sts https://foundation.mozilla.org/en/research/library/generative-ai-training-data/common-crawl (1/10)

#ai #generativeai #llm #datagovernance #sts #commoncrawl