#speechcorpora β Public Fediverse posts
Live and recent posts from across the Fediverse tagged #speechcorpora, aggregated by home.social.
-
π #MOSEL: Multilingual Open-Source European Languages Dataset
β’ π 950,000 hours of #speech data covering 24 official EU languages
β’ ποΈ Includes up to 441K hours of unlabeled speech from #VoxPopuli and #LibriLight
β’ π€ Transcribed using #Whisper large v3 #ASR model
β’ π·οΈ Covers both labeled and unlabeled #speechcorpora
β’ π Released under #CCBY40 license for #opensource use
β’ π§ Designed for training #AI #speechrecognition modelsKey features:
β’ Diverse language coverage
β’ Large-scale dataset
β’ Open-source compliant
β’ Includes pseudo-labeled data
β’ Supports #NLP and #machinelearning researchLearn more: https://huggingface.co/datasets/FBK-MT/mosel?s=09