#common-corpus — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #common-corpus, aggregated by home.social.
-
An Open Training Set For AI Goes Global
https://web.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/
-
Common Corpus, an open training set for AI, goes global – and so should support for itAs many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work […]
#aiAlliance #commonCorpus #curation #euAiAct #financeCommons #france #gdpr #github #legalCommons #llms #multilingual #openCulture #openGovernment #openScience #openSource #openWeb #pdf #permissiveLicensing #pleias #publicDomain #scraping #tokens #toxicity #wikimedia #youtube https://walledculture.org/common-corpus-an-open-training-set-for-ai-goes-global-and-so-should-support-for-it/ -
Comment les IA se nourrissent de livres piratés ?
https://fed.brid.gy/r/https://korben.info/ia-entrainement-donnees-piratees-books3-common-cor.html
-
The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on straining the fair use policy to the detriment of the commons. -
@xolotl @creativecommons
OK, as an end run around the legal problems of LLMs' training corpora, it's a start - but there are jurisdictions (with a strong authors' rights tradition) in which even PD works are legally owed a form of attribution ("PD is basically CC BY" in those places).
So the #CommonCorpus isn't a global legal solution. -
happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also built from openly licensed works (eg, shared with @creativecommons licenses — which are open, but still copyrighted and not PD) https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613