On Sat, Mar 19, 2016 at 11:42 PM, Marko Rauhamaa <ma...@pacujo.net> wrote: >> The problem is not so much the existence of combining characters, but that >> *some* but not all accented characters are available in two forms: a >> composed single code point, and a decomposed pair of code points. > > Also, is an a with ring on top and another ring on bottom the same > character as an a with ring on bottom and another ring on top?
Unicode has an answer for this one. It's called normalization, and actually it doesn't quite go as far as I thought, but it does at least solve this exact question. >>> print(ascii(unicodedata.normalize("NFC","a\u0325\u030a"))) '\u1e01\u030a' >>> print(ascii(unicodedata.normalize("NFC","a\u030a\u0325"))) '\u1e01\u030a' >>> print(ascii(unicodedata.normalize("NFD","a\u0325\u030a"))) 'a\u0325\u030a' >>> print(ascii(unicodedata.normalize("NFD","a\u030a\u0325"))) 'a\u0325\u030a' So yes, they are the same combined character. Whether you ask for the composed form or the decomposed form, you get the exact same sequence of codepoints from either initial ordering - either this: 'a' LATIN SMALL LETTER A '\u0325' COMBINING RING BELOW '\u030a' COMBINING RING ABOVE or this: '\u1e01' LATIN SMALL LETTER A WITH RING BELOW '\u030a' COMBINING RING ABOVE but never this: '\xe5' LATIN SMALL LETTER A WITH RING ABOVE '\u0325' COMBINING RING BELOW which will normalize to either of the above. I had been of the belief that NFC/NFD normalization would *always* provide a canonical ordering for the combining characters, but apparently only some are affected: >>> print(ascii(unicodedata.normalize("NFC","q\u0303\u0301"))) 'q\u0303\u0301' >>> print(ascii(unicodedata.normalize("NFC","q\u0301\u0303"))) 'q\u0301\u0303' (And NFK[CD] doesn't change this either.) But if you're really worried about these kinds of equivalencies, you could write your own "super-normalize" function which first NFKD normalizes, then sorts all sequences of combining characters into codepoint order, and finally NFKC or NFKD normalizes to canonicalize everything. ChrisA -- https://mail.python.org/mailman/listinfo/python-list