On Sat, 19 Mar 2016 11:42 pm, Marko Rauhamaa wrote: > The problem is not theoretical. If I implement a web form and someone > enters "Aña" as their name, how do I make sure queries find the name > regardless of the unicode code point sequence? I have to normalize using > unicodedata.normalize().
I didn't say that it was theoretical. It is a real problem, but it is a problem with human languages: the number of characters-with-accents is vast, possibly impossibly vast. They can't all have unique code points. I must admit I had completely missed your example of multiple combining characters, that's a good one. Here's the example again: a + combining ring above + combining ring below, versus a + combining ring below + combining ring above Naturally just comparing them gives unequal: py> s = "a\u030A\u0325" py> t = "a\u0325\u030A" py> s == t False But we can normalise them: ==== ============= ============= ================== ================= Form NFC NFKC NFKD NFKD ==== ============= ============= ================== ================= s U+1E01,030A U+1E01,030A U+0061,0325,030A U+0061,0325,030A t U+1E01,030A U+1E01,030A U+0061,0325,030A U+0061,0325,030A ==== ============= ============= ================== ================= As you can see, *any* of the normalisation forms will put the code points into the same, canonical order, making them equal. > When glorifying Python's advanced Unicode capabilities, are we careful > to emphasize the necessity of unicodedata.normalize() everywhere? Should > Python normalize strings unconditionally and transparently? What does > the O(1) character lookup mean under normalization? > > Some weeks ago I had to spend 30 minutes to debug my Python program when > a user complained it didn't work. Turns out they had accidentally > invoked the program using a space and a composing tilde instead of the > ASCII ~. There was no visual indication of a problem on the screen, but > the Python program acted up. We recently had somebody here who wrote capital I by pressing the lower case l on the keyboard. Should a pure-ASCII program be able to operate without malfunction if the user confuses 0 and O, or I l and 1? What about ' and ` or possibly even '' and "? -- Steven -- https://mail.python.org/mailman/listinfo/python-list