Steven D'Aprano <st...@pearwood.info>: > As usual, Unicode problems are generally due to backwards > compatibility. Blame the old legacy encodings, which invented the > "dead keys" a.k.a. "combining character" technique. Of course, they > had a reasonable excuse at the time, but Unicode's requirement of > being able to losslessly handle all legacy character set standards > means that Unicode has to provide the same functionality.
The combining characters allow for maze of twisty little combinations, all alike. There's no limit to the number of diacritics you can pile on, under and next to the base character. Was that universality unavoidable? Maybe it was. Deep down, all scripts are two-dimensional. > The problem is not so much the existence of combining characters, but that > *some* but not all accented characters are available in two forms: a > composed single code point, and a decomposed pair of code points. Also, is an a with ring on top and another ring on bottom the same character as an a with ring on bottom and another ring on top? > This adds complexity and means that equality of characters is not > well-defined. (Hence Unicode punts on the whole "character" thing and > just talks about code points.) The problem is not theoretical. If I implement a web form and someone enters "Aña" as their name, how do I make sure queries find the name regardless of the unicode code point sequence? I have to normalize using unicodedata.normalize(). When glorifying Python's advanced Unicode capabilities, are we careful to emphasize the necessity of unicodedata.normalize() everywhere? Should Python normalize strings unconditionally and transparently? What does the O(1) character lookup mean under normalization? Some weeks ago I had to spend 30 minutes to debug my Python program when a user complained it didn't work. Turns out they had accidentally invoked the program using a space and a composing tilde instead of the ASCII ~. There was no visual indication of a problem on the screen, but the Python program acted up. Marko -- https://mail.python.org/mailman/listinfo/python-list