home.social

#common-corpus — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #common-corpus, aggregated by home.social.

fetched live
  1. Common Corpus, an open training set for AI, goes global – and so should support for it

    As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work […]

    #aiAlliance #commonCorpus #curation #euAiAct #financeCommons #france #gdpr #github #legalCommons #llms #multilingual #openCulture #openGovernment #openScience #openSource #openWeb #pdf #permissiveLicensing #pleias #publicDomain #scraping #tokens #toxicity #wikimedia #youtube walledculture.org/common-corpu
  2. The very first order of work would be to rely on free cultural works, consensually released, as the source of training - projects such as #CommonCorpus being a step in the right direction. Anything else is a copyright nightmare in the making, not even considering the ethical implications on straining the fair use policy to the detriment of the commons.
  3. @xolotl @creativecommons
    OK, as an end run around the legal problems of LLMs' training corpora, it's a start - but there are jurisdictions (with a strong authors' rights tradition) in which even PD works are legally owed a form of attribution ("PD is basically CC BY" in those places).
    So the #CommonCorpus isn't a global legal solution.

  4. happy to see that the #CommonCorpus shared today as a "public domain" dataset for training #AI is built from only PD materials, & not also built from openly licensed works (eg, shared with @creativecommons licenses — which are open, but still copyrighted and not PD) huggingface.co/collections/Ple