62 points | by Rendello16 小时前
When lowercasing Σ (U+03A3 Greek capital letter sigma) it is context-sensitive (on whether it is at the end of a word) whether it becomes σ (U+03C3 Greek small letter sigma) or ς (U+03C2 Greek small letter final sigma).
For those interested, this was the generation script. I'm sure there was a way to do it better or simpler, and I wish I could just say this was a quick-and-dirty script, but in fact I spent quite a few hours on it (this is the fourth rewrite):
https://gist.github.com/rendello/b06ca3d976d26fa011897bd1603...
The "ff" ligature, for example, is uppercased by Python as "FF", meaning it both becomes two separate characters, and is one byte smaller overall. I hope it's interesting.
Not sure I wanted to know...
For upper → lower → upper we have:
Ω ω Ω
İ i̇ İ
K k K
Å å Å
ẞ ß SS
ϴ θ Θ
For lower → upper → lower there are a lot more.
1. https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...
But yeah, I got the idea from GP's "ff" example, but I'm kinda shocked there are so many.
It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.
Aside from that, it‘s unlikely that authors writing both Turkish and non-Turkish words would properly switch their input method or language setting between both, so they would get mixed up in practice anyway.
There is no escape from knowing (or best-guessing) which language you are performing transformations on, or else just leave the text as-is.
(Pick a year, then think about why it didn't happen in that year.)
FD/49 are lower/upper dotless ı/I
DD/69 are upper/lower dotted İ/i.
There's nothing around the capability to round trip that through unicode that required 49 in ISO-8859-9 to be assigned the same unicode codepoint as 49 in ISO-8859-1 because they happen to be visually identical
I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.
This stack overflow has more details - but apparently Turkish i and I are not their own Unicode code points which is why this ends up gnarly.
https://stackoverflow.com/questions/48067545/why-does-unicod...
In Turkish:
• Lowercase dotted I ("i") maps to uppercase dotted I ("İ")
• Lowercase dotless I ("ı") maps to uppercase dotless I ("I")
In English, uppercase dotless I ("I") maps to lowercase dotted I ("i"), because those are the only kinds we have.
Ew! So it's a conflict of language behavior. There's no "correct" way to handle this unless you know which language is currently in use!
Even if you were to start over, I'm not convinced that using different unicode point points would have been the right solution since the rest of the alphabet is the same.
Falsehoods programmers believe about strings...
Oh that's Unicode for you. It's not that they're "roundtrip unsafe", it's just that Unicode is a total and complete clusterfuck.
Bruce Schneier in 2000 on Unicode security risks:
https://www.schneier.com/crypto-gram/archives/2000/0715.html...
Of course the attacks he envisioned materialized, like homoglyph attacks using internationalized domain names.
My favorite line from Schneier: "Unicode is just too complex to ever be secure".
And, no matter if you love Unicode or not, there's lots of wisdom in there.
When design-by-committee gives birth to something way too complex, insecurity is never far behind.
[...]
> When design-by-committee gives birth to something way too complex, insecurity is never far behind.
Human writing is (and has historically been) a "clusterfuck". Any system that's designed to encode every single known human writing system is bound to be way too complex.
I almost always side with blaming systems that are too complex or insecure by design as opposed to blaming the users (the canonical example being C++), but in the case of Unicode there's no way to make a simpler system; we'll keep having problems until people stop treating Unicode text as something that works more or less like English or Western European text.
In other words: if your code is doing input validation over an untrusted Unicode string in the year of our Lord 2024, no one is to blame but yourself.
(That's not to say the Unicode committee didn't make some blunders along the way -- for instance the Han unification was heavily criticized -- but those have nothing to do with the problems described by Schneier).
Unicode is complex because capturing all the worlds writing systems into a single system is categorically complex. Because human meatspace language is complex.
And even then if you decided to “rewrite the worlds language systems themselves” to conform to a simpler system it too would eventually evolve right back into the clusterfuck that is the worlds languages.
It’s inescapable. You cannot possibly corral however many billion people live on this planet into something less complex. Humans are too complex and the ideas and emotions they need to express are too complex.
The fact that Unicode does as good of a job as it does and has stuck around for so long is a pretty big testament to how well designed and versatile it is! What came before it was at least an order of magnitude worse and whatever replaces it will have to be several orders of magnitude better.
Whatever drives a Unicode replacement would have to demonstrate a huge upset to how we do things… like having to communicate with intelligent life on other planets or something and even then they probably have just as big of a cluster fuck as Unicode to represent whatever their writing system is. And even then Unicode might be able to support it!
>>> 'ẞ'.lower().upper()
'SS'
https://github.com/rendello/layout/issues/8#issuecomment-235...
https://devblogs.microsoft.com/oldnewthing/20241031-00/?p=11...
There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.