UTF-8 characters that behave oddly when the case is changed

(gist.github.com)

62 points | by Rendello16 小时前

8 comments

  • LukeShu4 小时前
    This doesn't include the oddest of all: sigma.

    When lowercasing Σ (U+03A3 Greek capital letter sigma) it is context-sensitive (on whether it is at the end of a word) whether it becomes σ (U+03C3 Greek small letter sigma) or ς (U+03C2 Greek small letter final sigma).

    • Rendello1 小时前
      Σ now shows up on my Unicode round-trip horror show ;)

      https://news.ycombinator.com/item?id=42020476

    • Rendello3 小时前
      True! This list could more accurately be described as "Unicode codepoints that expand or contract when case is changed in UTF-8", which is exactly what I was testing in my program. I had built a parser that was relying on some assumptions that I felt was not correct, so I built some tests with this data.

      For those interested, this was the generation script. I'm sure there was a way to do it better or simpler, and I wish I could just say this was a quick-and-dirty script, but in fact I spent quite a few hours on it (this is the fourth rewrite):

      https://gist.github.com/rendello/b06ca3d976d26fa011897bd1603...

    • BoxOfRain4 小时前
      That reminds me of the old 'long S'[1] that used to exist in English and survives in the symbol for integration. That worked in a ſimlar way for writing Engliſh, at the ſtart and middle of words you'd use the long s but not at the end so you end up with 'poſſeſs' for 'possess'. There were other rules around it too, I think you'd always use the usual S for a capital.

      [1]https://en.wikipedia.org/wiki/Long_s

      • hanche3 小时前
        Not only in English. My local newspaper (in Trondheim, Norway) shows its name as Adresſeaviſen on the front page (in a fractur font to boot).
  • Rendello16 小时前
    I generated this list from a Python script I wrote a few months back for use in property tests in a Rust codebase. Its meant to break parsers that make bad assumptions about UTF-8, like assuming that upper- or lowercasing a character will always result in the character encoding having the same size in bytes, or even that it will result in one character.

    The "ff" ligature, for example, is uppercased by Python as "FF", meaning it both becomes two separate characters, and is one byte smaller overall. I hope it's interesting.

    • xg156 小时前
      The implication of this is that there are also "roundtrip-unsafe" characters, i.e. flip_case(flip_case(x)) != x, right?

      Not sure I wanted to know...

      • Rendello1 小时前
        I know Halloween was yesterday but let's discover this horror together with some terrifying Python[1]! Turns out, yep.

        For upper → lower → upper we have:

        Ω ω Ω

        İ i̇ İ

        K k K

        Å å Å

        ẞ ß SS

        ϴ θ Θ

        For lower → upper → lower there are a lot more.

        1. https://gist.github.com/rendello/4d8266b7c52bf0e98eab2073b38...

        • xg1558 分钟前
          This is really cool! Thanks a lot for the effort!

          But yeah, I got the idea from GP's "ff" example, but I'm kinda shocked there are so many.

      • LegionMammal9786 小时前
        A standard example here is the Turkish dotless I, which yields "ı" → "I" → "i" with most case-conversion libraries.
        • Macha6 小时前
          It feels like unifying it with the ASCII i is the mistake here. There should have just been 4 turkish characters in 2 pairs, rather than trying to reuse I/i

          It's not like we insist Α (Greek) is the same as A (Latin) or А (Cyrillic) just because they're visually identical.

          • WorldMaker5 小时前
            But even with separate characters, you aren't safe because the ASCII "unification" isn't just Unicode's fault to begin with, in some cases it is historic/cultural in its own ways: German ß has distinct upper and lower case forms, but also has a complicated history of sometimes, depending on locale, the upper case form is "SS" rather than the upper-case form of ß. In many of those same locales the lower-case form of "SS" is "ss", not ß. It doesn't even try to round-trip, and that's sort of intentional/cultural.
            • atoav2 小时前
              Uppercase ẞ exists since 2017, so before that using SS as a replacement was the correct way of doing things. That is relatively recent wh3n it comes tonthat kind of change
          • layer84 小时前
            This stems from the earlier Turkish 8-bit character sets like IBM code page 857, which Unicode was designed to be roundtrip-compatible with.

            Aside from that, it‘s unlikely that authors writing both Turkish and non-Turkish words would properly switch their input method or language setting between both, so they would get mixed up in practice anyway.

            There is no escape from knowing (or best-guessing) which language you are performing transformations on, or else just leave the text as-is.

          • Arnt5 小时前
            When do you think that first mistake happened?

            (Pick a year, then think about why it didn't happen in that year.)

            • Macha5 小时前
              When Unicode was being specced out originally I guess. There was more interest in unifying characters at that stage (see also the far more controversial Han unification)
              • Arnt5 小时前
                Uh-huh. At that time roundtrip compatiblity with all widely used 8-bit encodings was a major design criterion. Roundtrip meaning that you could take an input string in e.g. iso 8859-9, convert it to unicode, convert it back, and get the same string, still usable for purposes like database lookups. Would you have argued to break database lookups at the time?
                • Macha5 小时前
                  ISO-8859-9 actually does have what I suggest:

                  FD/49 are lower/upper dotless ı/I

                  DD/69 are upper/lower dotted İ/i.

                  There's nothing around the capability to round trip that through unicode that required 49 in ISO-8859-9 to be assigned the same unicode codepoint as 49 in ISO-8859-1 because they happen to be visually identical

                  • josefx3 小时前
                    There is a reason: ISO-8859-9 is an extended ASCII character set. The shared characters are not an accident, they are by definition the same characters. Most ISO character sets follow a specific template with fixed ranges for shared and custom characters. Interpreting that i as anything special would violate the spec.
        • JimDabell5 小时前
          The transliteration of this specific character was also involved in a violent attack and suicide:

          https://languagelog.ldc.upenn.edu/nll/?p=73

          • von_lohengramm4 小时前
            Hardly fair to call it a murder-suicide. Ramazan killed Emine in self-defense.
            • JimDabell4 小时前
              You’re absolutely right, I misremembered the details. Thanks for the correction!
        • Wowfunhappy4 小时前
          So, uh, is this actually desirable per the Turkish language? Or is it more-or-less a bug?

          I'm having trouble imagining a scenario where you wouldn't want uppercase and lowercase to map 1-to-1, unless the entire concept of "uppercase" and "lowercase" means something very different in that language, in which case maybe we shouldn't be calling them by those terms at all.

          • dgoldstein04 小时前
            My understanding is it's a bug that the case changes don't round trip correctly, in part due to questionable Unicode design that made the upper and lower case operations language dependent.

            This stack overflow has more details - but apparently Turkish i and I are not their own Unicode code points which is why this ends up gnarly.

            https://stackoverflow.com/questions/48067545/why-does-unicod...

            • Wowfunhappy3 小时前
              Ah, I see the problem now!

              In Turkish:

              • Lowercase dotted I ("i") maps to uppercase dotted I ("İ")

              • Lowercase dotless I ("ı") maps to uppercase dotless I ("I")

              In English, uppercase dotless I ("I") maps to lowercase dotted I ("i"), because those are the only kinds we have.

              Ew! So it's a conflict of language behavior. There's no "correct" way to handle this unless you know which language is currently in use!

              Even if you were to start over, I'm not convinced that using different unicode point points would have been the right solution since the rest of the alphabet is the same.

        • 5 小时前
          undefined
      • automatic61314 小时前
        > flip_case(flip_case(x)) == x

        Falsehoods programmers believe about strings...

      • JadeNB6 小时前
        Indeed, the parent already gives one: flip_case(flip_case("ff")) = "ff". (Since it's hard to tell with what I guess is default ligature formation, at least in my browser, the first is an 'ff' ligature and the second is two 'f's.)
      • TacticalCoder6 小时前
        > Not sure I wanted to know...

        Oh that's Unicode for you. It's not that they're "roundtrip unsafe", it's just that Unicode is a total and complete clusterfuck.

        Bruce Schneier in 2000 on Unicode security risks:

        https://www.schneier.com/crypto-gram/archives/2000/0715.html...

        Of course the attacks he envisioned materialized, like homoglyph attacks using internationalized domain names.

        My favorite line from Schneier: "Unicode is just too complex to ever be secure".

        And, no matter if you love Unicode or not, there's lots of wisdom in there.

        When design-by-committee gives birth to something way too complex, insecurity is never far behind.

        • moefh5 小时前
          > it's just that Unicode is a total and complete clusterfuck

          [...]

          > When design-by-committee gives birth to something way too complex, insecurity is never far behind.

          Human writing is (and has historically been) a "clusterfuck". Any system that's designed to encode every single known human writing system is bound to be way too complex.

          I almost always side with blaming systems that are too complex or insecure by design as opposed to blaming the users (the canonical example being C++), but in the case of Unicode there's no way to make a simpler system; we'll keep having problems until people stop treating Unicode text as something that works more or less like English or Western European text.

          In other words: if your code is doing input validation over an untrusted Unicode string in the year of our Lord 2024, no one is to blame but yourself.

          (That's not to say the Unicode committee didn't make some blunders along the way -- for instance the Han unification was heavily criticized -- but those have nothing to do with the problems described by Schneier).

        • Sharlin5 小时前
          How could you ever make it simple given that the problem domain itself is complex as fuck? Should we all just have stuck with code pages and proprietary character encodings? Or just have people unable to use their own languages? Or even to spell their own names? It’s easy for a culturally blind English speaker to complain that text should be simple, must be due to design by committee that it isn’t!
        • SAI_Peregrinus5 小时前
          Unicode is worse than design-by-committee. It's a design-by-committee attempt to represent several hundred design-by-culture systems in one unified whole. Desgin-by-culture is even messier than design-by-committee, since everyone in the culture contributes to the design and there's never a formal specification, you just have to observe how something is used!
        • Arnt5 小时前
          Could you try an argument that unicode is insecure compared to roll-your-own support for the necessary scripts? You may consider "necessary" to mean "the ones used in countries where at least two of Microsoft, Apple and Sun sold localised OSes".
        • cruffle_duffle4 小时前
          If you tried to come up with a “lightweight” Unicode alternative it would almost certainly evolve right back into the clusterfuck that Unicode is. In fact the odds would mean it would probably be even worse.

          Unicode is complex because capturing all the worlds writing systems into a single system is categorically complex. Because human meatspace language is complex.

          And even then if you decided to “rewrite the worlds language systems themselves” to conform to a simpler system it too would eventually evolve right back into the clusterfuck that is the worlds languages.

          It’s inescapable. You cannot possibly corral however many billion people live on this planet into something less complex. Humans are too complex and the ideas and emotions they need to express are too complex.

          The fact that Unicode does as good of a job as it does and has stuck around for so long is a pretty big testament to how well designed and versatile it is! What came before it was at least an order of magnitude worse and whatever replaces it will have to be several orders of magnitude better.

          Whatever drives a Unicode replacement would have to demonstrate a huge upset to how we do things… like having to communicate with intelligent life on other planets or something and even then they probably have just as big of a cluster fuck as Unicode to represent whatever their writing system is. And even then Unicode might be able to support it!

    • slome15 小时前
      Thanks for the insight. I had never considered this even though i researched quite some oddities in UTF-8 parsing myself over the years. It's the gift that keeps on giving when it comes to ways to breaking things in software, i find. Time to go over my code again.
    • zahlman3 小时前
      Another assumption worth testing is that casing round-trips:

          >>> 'ẞ'.lower().upper()
          'SS'
  • throwaway1739206 小时前
    In one of my work projects it was the Turkish İ that gave us trouble. In some case-insensitive text searching code, we matched the lowercase query against the lowercase text, and had to handle cases like that specially to avoid reporting the wrong matching span in the original text, since the lowercase string would have a different length than the uppercase string. This was one of my first real-world projects and opened my eyes a bit to the importance of specifications and standards.
    • pavel_lishin5 小时前
      Can't mention the Turkish case situation without mentioning the actual murder that took place because of it: https://languagelog.ldc.upenn.edu/nll/?p=73
      • Filligree4 小时前
        The murder is a tragedy, of course, but I would hesitate to blame the cellphone. There’s overreactions, and then there’s… this.
        • oguz-ismail4 小时前
          > I would hesitate to blame the rapist, look at what she was wearing!
          • hinkley4 小时前
            Except in this case the phone was the dress.
    • Rendello2 小时前
      I had this exact bug with the same character:

      https://github.com/rendello/layout/issues/8#issuecomment-235...

    • johannes12343216 小时前
      In PHP the Turkish locale caused quite some trouble. In some situations a different locale was used for compiling and for runtime while handling "case-insensiteve" identifiers, fo that sometimes names with an "I" could not be found anymore.
  • D-Coder5 小时前
    Raymond Chen's "Old New Thing" blog just commented on a similar issue: What has case distinction but is neither uppercase nor lowercase?

    https://devblogs.microsoft.com/oldnewthing/20241031-00/?p=11...

  • hgs34 小时前
    This isn't "odd" behavior. It's a consequence of using a multibyte encoding scheme. Also, when dealing with case mapping, you can't assume that the character count will remain constant. This is because in Unicode full case mappings can map a character to multiple characters, meaning you might end up with more characters than you started with, regardless of the encoding used.
  • rmrfchik6 小时前
    It's not UTF-8 characters but Unicode.
    • jonhohle6 小时前
      If you look at the list, it’s primarily (but not completely) about oddities in their UTF-8 encoding. Most of them appear to be on the boundary of adding additional bytes when the case is changed. That’s not really Unicode’s concern.

      There are also some that appear to change from single characters to grapheme clusters, which would be a Unicode quirk.

    • Rendello3 小时前
      In another comment I said that a more accurate title would have been "Unicode codepoints that expand or contract when case is changed in UTF-8", which I think covers it well.
    • Aardwolf6 小时前
      The byte-changes listed are for the UTF-8 encoding though, so it's about UTF-8 in that sense
    • Retr0id5 小时前
      It's both.
      • zahlman3 小时前
        UTF-8 is simply an encoding; "UTF-8 characters" is just not correct use of language. Just like, say, "binary number"; a number has the same value regardless of the base you use to write it, and the base is a scheme for representing it, not a system for defining what a number is. This is a common imprecision in language which I have seen cause serious difficulties in learning concepts properly.
        • Retr0id2 小时前
          "unicode codepoint sequences whose codepoint lengths and/or utf8-code-unit-lengths behave oddly when you change their case" would not fit in a HN title, however
  • layer83 小时前
    The canonical source data for this is https://www.unicode.org/Public/16.0.0/ucd/CaseFolding.txt, by the way.
  • zzo38computer15 小时前
    I wrote another comment relating to case folding: https://news.ycombinator.com/item?id=41784627