home.social

#pleias — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #pleias, aggregated by home.social.

  1. Common Corpus, an open training set for AI, goes global – and so should support for it

    As many of the AI stories on Walled Culture attest, one of the most contentious areas in the latest stage of AI development concerns the sourcing of training data. To create high-quality large language models (LLMs) massive quantities of training data are required. In the current genAI stampede, many companies are simply scraping everything they can off the Internet. Quite how that will work […]

    #aiAlliance #commonCorpus #curation #euAiAct #financeCommons #france #gdpr #github #legalCommons #llms #multilingual #openCulture #openGovernment #openScience #openSource #openWeb #pdf #permissiveLicensing #pleias #publicDomain #scraping #tokens #toxicity #wikimedia #youtube walledculture.org/common-corpu
  2. > Today, we are announcing #Amazon, #Meta, #Microsoft, #mistralai , and #Perplexity for the first time as they join our roster of partners, which includes #Google, #Ecosia, #Nomic, #Pleias, #ProRata, and #ReefMedia. All these organizations utilize #WikimediaEnterprise to integrate human-governed knowledge into their platforms at scale. By doing so, they help ensure that the work of our global volunteer community reaches billions of people with the accuracy and transparency that Wikipedia represents.

    And that a good new for me.

    #wikimedia #wikipedia #ai

    enterprise.wikimedia.com/blog/

  3. Really happy to see a new #copyleft -based #LLM , and this one seems to be more general-purpose than former attempts such as #PleIAs. The #Comma model is trained with #CommonPile, a new training pile with 8 TB of public domain and copyleft data. huggingface.co/papers/2506.052…

    Paper page - The Common Pile v...

  4. Really happy to see a new #copyleft -based #LLM , and this one seems to be more general-purpose than former attempts such as #PleIAs. The #Comma model is trained with #CommonPile, a new training pile with 8 TB of public domain and copyleft data. huggingface.co/papers/2506.052…