home.social

#voicedata — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #voicedata, aggregated by home.social.

  1. 🗣️ Your smart speaker’s always got one ear open.

    It listens 24/7 for wake words like “Hey Alexa”—then sends recordings to the cloud (yes, sometimes reviewed by real people 👀).

    📌 Hit mute when not in use
    📌 Turn on auto-delete (major companies allow it!)
    📌 Wake words = always listening locally

    #SmartHome #VoiceData #PrivacyTips #CyberSecurity

  2. For the past couple of years, as each new @mozilla #CommonVoice dataset of #voice #data is released, I've been using @observablehq to visualise the #metadata coverage across the 100+ languages in the dataset.

    Version 17 was released yesterday (big ups to the team - EM Lewis-Jong, @jessie, Gina Moape, Dmitrij Feller) and there's some super interesting insights from the visualisation:

    ➡ Catalan (ca) now has more data in Common Voice than English (en) (!)

    ➡ The language with the highest average audio utterance duration at nearly 7 seconds is Icelandic (is). Perhaps Icelandic words are longer? I suspect so!

    ➡ Spanish (es), Bangla (Bengali) (bn), Mandarin Chinese (zh-CN) and Japanese (ja) all have a lot of recorded utterances that have not yet been validated. Albanian (sq) has the highest percentage of validated utterances, followed closely by Erzya / Arisa (myv).

    ➡ Votic (vot) has the highest percentage of invalidated utterances, but with 76% of utterances invalidated, I wonder if this language has been the target of deliberate invalidation activity (invalidating valid sentences, or recording sentences to be deliberately invalid) given the geopolitical instability in Russia currently.

    See the visualisation here and let me know your thoughts below!

    observablehq.com/@kathyreid/mo

    #linguistics #languages #data #VoiceAI #VoiceData #SpeechAI #SpeechData #DataViz