#speechcorpora — Public Fediverse posts on home.social

🌍 #MOSEL: Multilingual Open-Source European Languages Dataset

• 📊 950,000 hours of #speech data covering 24 official EU languages
• 🎙️ Includes up to 441K hours of unlabeled speech from #VoxPopuli and #LibriLight
• 🤖 Transcribed using #Whisper large v3 #ASR model
• 🏷️ Covers both labeled and unlabeled #speechcorpora
• 📜 Released under #CCBY40 license for #opensource use
• 🧠 Designed for training #AI #speechrecognition models

Key features:
• Diverse language coverage
• Large-scale dataset
• Open-source compliant
• Includes pseudo-labeled data
• Supports #NLP and #machinelearning research

Learn more: https://huggingface.co/datasets/FBK-MT/mosel?s=09