#utf — Public Fediverse posts
Live and recent posts from across the Fediverse tagged #utf, aggregated by home.social.
-
🆕 blog! “A small collection of text-only websites”
A couple of years ago, I started serving my blog posts as plain text. Add .txt to the end of any URl and get a deliciously lo-fi, UTF-8, mono[chrome|space] alternative.
Here's this post in plain text - https://shkspr.mobi/blog/2025/12/a-small-collection-of-text-only-websites.txt
Obviously a webpage…
👀 Read more: https://shkspr.mobi/blog/2025/12/a-small-collection-of-text-only-websites/
⸻
#blogging #blogs #text #unicode #utf-8 -
Recently, we talked about #libid3tag and our intent to make a new release. So far, we have a preview of some changes that have already been made in the latest main:
- Mojibake fixes for #UTF-16 (no BOM) encoded fields.
- Some code cleanups, including warning fixes.
- Compatibility with #CMake > 4.0 (we now require CMake 3.10+)Meanwhile, we are also working on #Doxygen documentation to better document the library too, so quite a few things are going on for libid3tag right now.
-
“The Best – But Not Good – Way To Limit String Length”, Adam Pritchard (https://adam-p.ca/blog/2025/04/string-length/).
On HN: https://news.ycombinator.com/item?id=43850398
#Programming #PLDI #Strings #Length #Unicode #Characters #Bytes #Graphemes #UTF #CodePoints #I18N #Internationalization
-
“The Best – But Not Good – Way To Limit String Length”, Adam Pritchard (https://adam-p.ca/blog/2025/04/string-length/).
On HN: https://news.ycombinator.com/item?id=43850398
#Programming #PLDI #Strings #Length #Unicode #Characters #Bytes #Graphemes #UTF #CodePoints #I18N #Internationalization
-
“The Best – But Not Good – Way To Limit String Length”, Adam Pritchard (https://adam-p.ca/blog/2025/04/string-length/).
On HN: https://news.ycombinator.com/item?id=43850398
#Programming #PLDI #Strings #Length #Unicode #Characters #Bytes #Graphemes #UTF #CodePoints #I18N #Internationalization
-
“The Best – But Not Good – Way To Limit String Length”, Adam Pritchard (https://adam-p.ca/blog/2025/04/string-length/).
On HN: https://news.ycombinator.com/item?id=43850398
#Programming #PLDI #Strings #Length #Unicode #Characters #Bytes #Graphemes #UTF #CodePoints #I18N #Internationalization
-
“The Best – But Not Good – Way To Limit String Length”, Adam Pritchard (https://adam-p.ca/blog/2025/04/string-length/).
On HN: https://news.ycombinator.com/item?id=43850398
#Programming #PLDI #Strings #Length #Unicode #Characters #Bytes #Graphemes #UTF #CodePoints #I18N #Internationalization
-
UTF-8 Is Beautiful - It’s likely that many Hackaday readers will be aware of UTF-8, the mechanism for i... - https://hackaday.com/2025/09/14/utf-8-is-beautiful/ #softwarehacks #characterset #utf-8
-
Every time I look at #Unicode gotchas, I 😰:
“RFC 9839 And Bad Unicode”, Tim Bray (https://www.tbray.org/ongoing/When/202x/2025/08/14/RFC9839).
On HN: https://news.ycombinator.com/item?id=44995640
On Lobsters: https://lobste.rs/s/qrs9w8/rfc_9839_bad_unicode
#UTF #Encoding #Text #RFC #RFC9839 #Validation #ErrorHandling #UTF8
-
Very cool, copy-paste UTF text from, e.g., Wikipedia, get Unicode.
Sanskrit अश्विन्
can be in your HTML as
अशिवन्
https://r12a.github.io/app-conversion/
#UTF #Unicode #conversion -
Why does this PHP construct:
normalizer_normalize( $search_string, \Normalizer::FORM_D );
Convert ÖÖÖ to OOO, but keeps ÅÅÅ as ÅÅÅ ... WTF?! 🤔
-
Diese Jahr ging die Weihnachtsspende von @sweetgood an den Umwelttreuhand-Fonds (UTF) (https://umwelt-treuhandfonds.de/). Dieser finanziert die Anwält:innen von Klimaaktivist:innen, die aktuell massiven Repressionen ausgesetzt sind.
Weitere 50€ gingen an den KUEÖ e.V., also direkt an die @AufstandLastGen
#SWEETGOOD #andersGOOD #LetzteGeneration #Klimaschutz #Schutz #UTF #Spende #spenden
-
Just lost 3 hours to the charset encoding inferno: my source code is in UTF-8 but the library I use assume 1 byte per char.
Add to that, some font have only a subset of char.
You get a nice mix of UTF-8 char that may render nicely and or not (depending if the first byte is a char present in the font)."Sometimes I wonder what's worse between charset encoding and timezones." says the guy who makes clocks and displays...
-
So my former colleague @jstepien is a brillant engineer / speaker / teacher, but the thing he'll be internet famous for is how websites can't handle his name 🤷♂️. https://wtf-8.stępień.com is really funny, though.
-
Did you know that apparently completely different strings are interpreted as identical by some tools?
This is due to redundant UTF-8 encodings of the same Unicode characters.
Read more below 🧵
-
Imagine hiding hidden messages in alleged whitespace.
-
@Silberwoelfin Na ja, um fair zu sein: #UTF gibt es gerade erst seit 22 Jahren - so schnell ist das halt nicht überall implementiert.
*wegduck*
-
#Development #Introductions
An introduction to character encoding · Unicode and UTF encoding/decoding explained https://ilo.im/15zmvc_____
#Character #Encoding #Unicode #ASCII #UTF #UTF8 #JavaScript #WebDev #Frontend #Backend -
Die #LetzteGeneration @AufstandLastGen wird in puncto #Rechtskosten vom Umwelt-Treuhandfonds (#UTF) unterstützt. Wer den Repressionen gegen die Aktivist*innen etwas entgegensetzen möchte, kann das hier besonders schmerzlindernd tun.
»Der Umwelt-Treuhandfonds (#UTF) wurde 2021 gegründet, um Klima- und Umweltaktivist*innen in juristischen Angelegenheiten finanziell zu unterstützen. Strafverfahren, Präventivgewahrsam oder Demonstrationsverbote – die Aktivist*innen nehmen durch ihren vielfältigen Protest persönliche und juristische Konsequenzen auf sich. Der Umwelt-Treuhandfonds stellt sicher, dass die rechtsstaatlich verankerten Rechte der Aktivist*innen im Verfahren gewahrt und die Konsequenzen ihres Handelns durch eine kompetente juristische Vertretung minimiert werden.«
-
LibreOffice writer is rendering the character correctly :neofox_woozy:
#weather #icon #symbol #LibreOffice #LibreOfficeWriter #typography #utf #utf16 #utf32 -
Kolejny ciekawy problem z dziedziny przenośności: kodowania #UTF-16, UTF-32, UCS-2 i UCS-4 są zależne od kolejności bajtów. Oznacza to, że można je zakodować albo jako big endian, albo jako little endian. Kodując ciągi znaków, #Python używa kolejności bajtów systemu i dopisuje Byte Order Marker na początku pliku. Przy dekodowaniu, automatycznie odczytuje zapisany wcześniej BOM, by określić właściwą kolejność bajtów, dzięki czemu wszystko "po prostu działa".
Problemy zaczynają się, kiedy próbujemy porównać zakodowane dane na poziomie bajtów, np. porównując zapisany wcześniej jako UTF-16 plik z wynikiem wywołania `encode()`. Jeżeli plik był zapisany na systemie little endian (jak to zwykle bywa), a testy uruchamiane są na systemie big endian, nagle okaże się, że dostajemy dwa różne ciągi bajtów!
"Oczywistym" rozwiązaniem jest wymuszenie konkretnej kolejności bajtów, np. użyjąc kodowania `utf-16-le` zamiast `utf-16`. Tu jednak pojawia się kolejny problem — kiedy podajemy określoną kolejność bajtów, Python nie zapisuje już BOM — tak więc porównanie na poziomie bajtów wykaże różnicę w postaci brakującego BOM. Można to jednak rozwiązać prostą sztuczką — dopisując BOM (`\ufeff`) na początku kodowanego ciągu.
https://github.com/python/importlib_resources/pull/313/files
-
Another curious #portability pitfall: #UTF-16, UTF-32, UCS-2 and UCS-4 encoding are byte order dependent. That is, they can either be encoded as big endian or little endian. #Python uses the host byte order when encoding, and writes a Byte Order Marker at the beginning of the file. When decoding, it transparently reads the BOM back to determine the encoding, so everything works fine out of the box.
Problems start happening when you start comparing the exact byte-level output, e.g. by comparing a UTF-16 bytes read from a file with the result of `encode()`. If the file was written on a little endian system (which is commonly the case), and the test is running on a big endian system, you're suddenly going to get different strings!
The "obvious" way to solve this is to force a specific endianness, e.g. use `utf-16-le` rather than plain `utf-16`. However, when you force endianness, BOM is no longer used — so the byte-level data mismatches on the missing BOM now. The trick is, to add the BOM (`\ufeff`) straight into the #unicode string.
https://github.com/python/importlib_resources/pull/313/files
-
Chinese/Japanese/Korean characters take more bytes in #UTF-8 encoding than Latin letters. This seems unfair. However, CJK characters represent whole words or syllables, so CJK text in UTF-8 can still take fewer bytes than its English equivalent.
https://hsivonen.fi/string-length/#:~:text=UTF%2D8%20in%20unfair%20to%20CJK
-
@nirvdrum @postmodern But they don't support #UTF-8 character property groups, which can be important if you can't rely on input always being ASCII. The name "Björn" might be an example where this could matter.
-
Building Up Unicode Characters One Bit at a Time - The range of characters that can be represented by Unicode is truly bewildering. I... - https://hackaday.com/2023/09/07/building-up-unicode-characters-one-bit-at-a-time/ #peripheralshacks #truetypefont #codepoint #keyboard #unicode #binary #usbhid #glyph #utf-8