Chris Angelico <ros...@gmail.com>: > On Sat, Mar 19, 2016 at 8:31 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: >> Unicode made several (understandable but grave) mistakes along the way: >> >> * normalization > > Elaborate please? What's such a big mistake here?
Unicode shouldn't have allowed multiple equivalent variants for a string. Now Python falls victim to: >>> '\u006e\u0303' == '\u00f1' False <URL: https://en.wikipedia.org/wiki/Unicode_equivalence>: For example, the code point U+006E (the Latin lowercase "n") followed by U+0303 (the combining tilde "◌̃") is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter "ñ" of the Spanish alphabet). Therefore, those sequences should be displayed in the same manner, should be treated in the same way by applications such as alphabetizing names or searching, and may be substituted for each other. Marko -- https://mail.python.org/mailman/listinfo/python-list