#common-corpus — Public Fediverse posts on home.social

Techdirt [Unofficial] @[email protected] · 2026-03-24 · 22:37 UTC

An Open Training Set For AI Goes Global

https://web.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/

#commoncorpus #pleias #ai #aitraining #copyright #openlicenseing

Techdirt [Unofficial] @[email protected] · 2026-03-24 · 22:37 UTC

An Open Training Set For AI Goes Global

https://web.brid.gy/r/https://www.techdirt.com/2026/03/24/an-open-training-set-for-ai-goes-global/

#commoncorpus #pleias #ai #aitraining #copyright #openlicenseing

Walled Culture @[email protected] · 2026-02-25 · 13:00 UTC

Common Corpus, an open training set for AI, goes global – and so should support for it

As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work […]

#aiAlliance #commonCorpus #curation #euAiAct #financeCommons #france #gdpr #github #legalCommons #llms #multilingual #openCulture #openGovernment #openScience #openSource #openWeb #pdf #permissiveLicensing #pleias #publicDomain #scraping #tokens #toxicity #wikimedia #youtube https://walledculture.org/common-corpus-an-open-training-set-for-ai-goes-global-and-so-should-support-for-it/

#aialliance #commoncorpus #curation #euaiact #financecommons #france

Walled Culture @[email protected] · 2026-02-25 · 13:00 UTC

Common Corpus, an open training set for AI, goes global – and so should support for it

As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work […]

#aiAlliance #commonCorpus #curation #euAiAct #financeCommons #france #gdpr #github #legalCommons #llms #multilingual #openCulture #openGovernment #openScience #openSource #openWeb #pdf #permissiveLicensing #pleias #publicDomain #scraping #tokens #toxicity #wikimedia #youtube https://walledculture.org/common-corpus-an-open-training-set-for-ai-goes-global-and-so-should-support-for-it/

#aialliance #commoncorpus #curation #euaiact #financecommons #france

Le site de Korben [Unofficial] @[email protected] · 2025-12-24 · 16:27 UTC

Comment les IA se nourrissent de livres piratés ?

https://fed.brid.gy/r/https://korben.info/ia-entrainement-donnees-piratees-books3-common-cor.html

#actualitesbusinesslegislationjuridique #intelligenceartificielleactualitesia #viepriveeanonymathadopitelechargement #books3 #commoncorpus #copyright

Le site de Korben [Unofficial] @[email protected] · 2025-12-24 · 16:27 UTC

Comment les IA se nourrissent de livres piratés ?

https://web.brid.gy/r/https://korben.info/ia-entrainement-donnees-piratees-books3-common-cor.html

#actualitesbusinesslegislationjuridique #intelligenceartificielleactualitesia #viepriveeanonymathadopitelechargement #books3 #commoncorpus #copyright

Carlos Solís @[email protected] · 2025-05-29 · 00:28 UTC

The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on straining the fair use policy to the detriment of the commons.

#commoncorpus

poritzj @[email protected] · 2024-03-21 · 08:46 UTC

@xolotl @creativecommons
OK, as an end run around the legal problems of LLMs' training corpora, it's a start - but there are jurisdictions (with a strong authors' rights tradition) in which even PD works are legally owed a form of attribution ("PD is basically CC BY" in those places).
So the #CommonCorpus isn't a global legal solution.

#commoncorpus

poritzj @[email protected] · 2024-03-21 · 08:46 UTC

@xolotl @creativecommons
OK, as an end run around the legal problems of LLMs' training corpora, it's a start - but there are jurisdictions (with a strong authors' rights tradition) in which even PD works are legally owed a form of attribution ("PD is basically CC BY" in those places).
So the #CommonCorpus isn't a global legal solution.

#commoncorpus

Nate Angell @[email protected] · 2024-03-21 · 01:45 UTC

happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also built from openly licensed works (eg, shared with @creativecommons licenses — which are open, but still copyrighted and not PD) https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

#commoncorpus #ai

Nate Angell @[email protected] · 2024-03-21 · 01:45 UTC

happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also built from openly licensed works (eg, shared with @creativecommons licenses — which are open, but still copyrighted and not PD) https://huggingface.co/collections/PleIAs/common-corpus-65d46e3ea3980fdcd66a5613

#commoncorpus #ai