home.social

#utf — Public Fediverse posts

Live and recent posts from across the Fediverse tagged #utf, aggregated by home.social.

  1. 🆕 blog! “A small collection of text-only websites”

    A couple of years ago, I started serving my blog posts as plain text. Add .txt to the end of any URl and get a deliciously lo-fi, UTF-8, mono[chrome|space] alternative.

    Here's this post in plain text - shkspr.mobi/blog/2025/12/a-sma

    Obviously a webpage…

    👀 Read more: shkspr.mobi/blog/2025/12/a-sma

    #blogging #blogs #text #unicode #utf-8

  2. Recently, we talked about #libid3tag and our intent to make a new release. So far, we have a preview of some changes that have already been made in the latest main:

    - Mojibake fixes for #UTF-16 (no BOM) encoded fields.
    - Some code cleanups, including warning fixes.
    - Compatibility with #CMake > 4.0 (we now require CMake 3.10+)

    Meanwhile, we are also working on #Doxygen documentation to better document the library too, so quite a few things are going on for libid3tag right now.

  3. UTF-8 Is Beautiful - It’s likely that many Hackaday readers will be aware of UTF-8, the mechanism for i... - hackaday.com/2025/09/14/utf-8- #softwarehacks #characterset #utf-8

  4. Very cool, copy-paste UTF text from, e.g., Wikipedia, get Unicode.
    Sanskrit अश्विन्
    can be in your HTML as
    अशिवन्
    r12a.github.io/app-conversion/
    #UTF #Unicode #conversion

  5. Why does this PHP construct:

    normalizer_normalize( $search_string, \Normalizer::FORM_D );

    Convert ÖÖÖ to OOO, but keeps ÅÅÅ as ÅÅÅ ... WTF?! 🤔

    #programming #php #wtf #utf #utf8

  6. Diese Jahr ging die Weihnachtsspende von @sweetgood an den Umwelttreuhand-Fonds (UTF) (umwelt-treuhandfonds.de/). Dieser finanziert die Anwält:innen von Klimaaktivist:innen, die aktuell massiven Repressionen ausgesetzt sind.

    Weitere 50€ gingen an den KUEÖ e.V., also direkt an die @AufstandLastGen

    #SWEETGOOD #andersGOOD #LetzteGeneration #Klimaschutz #Schutz #UTF #Spende #spenden

  7. Just lost 3 hours to the charset encoding inferno: my source code is in UTF-8 but the library I use assume 1 byte per char.
    Add to that, some font have only a subset of char.
    You get a nice mix of UTF-8 char that may render nicely and or not (depending if the first byte is a char present in the font).

    "Sometimes I wonder what's worse between charset encoding and timezones." says the guy who makes clocks and displays...

    #UTF-8 #ISO-8859 #ASCII #Hell

  8. So my former colleague @jstepien is a brillant engineer / speaker / teacher, but the thing he'll be internet famous for is how websites can't handle his name 🤷‍♂️. wtf-8.stępień.com is really funny, though.

    #encoding #utf #fail

  9. Did you know that apparently completely different strings are interpreted as identical by some tools?

    This is due to redundant UTF-8 encodings of the same Unicode characters.

    Read more below 🧵

    #InfoSec #CyberSecurity #Hacking #Pentesting #UTF #Unicode

  10. @Silberwoelfin Na ja, um fair zu sein: #UTF gibt es gerade erst seit 22 Jahren - so schnell ist das halt nicht überall implementiert.

    *wegduck*

  11. Die #LetzteGeneration @AufstandLastGen wird in puncto #Rechtskosten vom Umwelt-Treuhandfonds (#UTF) unterstützt. Wer den Repressionen gegen die Aktivist*innen etwas entgegensetzen möchte, kann das hier besonders schmerzlindernd tun.

    »Der Umwelt-Treuhandfonds (#UTF) wurde 2021 gegründet, um Klima- und Umweltaktivist*innen in juristischen Angelegenheiten finanziell zu unterstützen. Strafverfahren, Präventivgewahrsam oder Demonstrationsverbote – die Aktivist*innen nehmen durch ihren vielfältigen Protest persönliche und juristische Konsequenzen auf sich. Der Umwelt-Treuhandfonds stellt sicher, dass die rechtsstaatlich verankerten Rechte der Aktivist*innen im Verfahren gewahrt und die Konsequenzen ihres Handelns durch eine kompetente juristische Vertretung minimiert werden.«

    umwelt-treuhandfonds.de/spende

  12. Kolejny ciekawy problem z dziedziny przenośności: kodowania #UTF-16, UTF-32, UCS-2 i UCS-4 są zależne od kolejności bajtów. Oznacza to, że można je zakodować albo jako big endian, albo jako little endian. Kodując ciągi znaków, #Python używa kolejności bajtów systemu i dopisuje Byte Order Marker na początku pliku. Przy dekodowaniu, automatycznie odczytuje zapisany wcześniej BOM, by określić właściwą kolejność bajtów, dzięki czemu wszystko "po prostu działa".

    Problemy zaczynają się, kiedy próbujemy porównać zakodowane dane na poziomie bajtów, np. porównując zapisany wcześniej jako UTF-16 plik z wynikiem wywołania `encode()`. Jeżeli plik był zapisany na systemie little endian (jak to zwykle bywa), a testy uruchamiane są na systemie big endian, nagle okaże się, że dostajemy dwa różne ciągi bajtów!

    "Oczywistym" rozwiązaniem jest wymuszenie konkretnej kolejności bajtów, np. użyjąc kodowania `utf-16-le` zamiast `utf-16`. Tu jednak pojawia się kolejny problem — kiedy podajemy określoną kolejność bajtów, Python nie zapisuje już BOM — tak więc porównanie na poziomie bajtów wykaże różnicę w postaci brakującego BOM. Można to jednak rozwiązać prostą sztuczką — dopisując BOM (`\ufeff`) na początku kodowanego ciągu.

    github.com/python/importlib_re

    #przenośność #unikod #Gentoo

  13. Another curious #portability pitfall: #UTF-16, UTF-32, UCS-2 and UCS-4 encoding are byte order dependent. That is, they can either be encoded as big endian or little endian. #Python uses the host byte order when encoding, and writes a Byte Order Marker at the beginning of the file. When decoding, it transparently reads the BOM back to determine the encoding, so everything works fine out of the box.

    Problems start happening when you start comparing the exact byte-level output, e.g. by comparing a UTF-16 bytes read from a file with the result of `encode()`. If the file was written on a little endian system (which is commonly the case), and the test is running on a big endian system, you're suddenly going to get different strings!

    The "obvious" way to solve this is to force a specific endianness, e.g. use `utf-16-le` rather than plain `utf-16`. However, when you force endianness, BOM is no longer used — so the byte-level data mismatches on the missing BOM now. The trick is, to add the BOM (`\ufeff`) straight into the #unicode string.

    github.com/python/importlib_re

    #Gentoo

  14. Chinese/Japanese/Korean characters take more bytes in #UTF-8 encoding than Latin letters. This seems unfair. However, CJK characters represent whole words or syllables, so CJK text in UTF-8 can still take fewer bytes than its English equivalent.

    hsivonen.fi/string-length/#:~:

  15. @nirvdrum @postmodern But they don't support #UTF-8 character property groups, which can be important if you can't rely on input always being ASCII. The name "Björn" might be an example where this could matter.