John Machin wrote: >> and, for things like u'\u0565\u0582' (ARMENIAN SMALL LIGATURE ECH >> YIWN), it does not even work. > > Sorry, I don't understand. > 0565 is stand-alone ECH > 0582 is stand-alone YIWN > 0587 is the ligature. > What doesn't work? At first guess, in the absence of an Armenian > informant, for pre-matching normalisation, I'd replace 0587 by the two > constituents -- just like 00DF would be expanded to "ss" (before > upshifting and before not caring too much about differences caused by > doubled letters).
Looking at the UnicodeData helps here: 00DF;LATIN SMALL LETTER SHARP S;Ll;0;L;;;;;N;;German;;; 0587;ARMENIAN SMALL LIGATURE ECH YIWN;Ll;0;L;<compat> 0565 0582;;;;N;;;;; So U+0587 is a compatibility character for U+0565,U+0582. Not sure what the rationale for *this* compatibility character is, but in many cases, they are in Unicode only for compatibility with some existing encoding - if they had gone through the proper Unification, they should not have been introduced as separate characters. In many cases, ligature characters exist for typographical reasons; other examples are FB00;LATIN SMALL LIGATURE FF;Ll;0;L;<compat> 0066 0066;;;;N;;;;; FB01;LATIN SMALL LIGATURE FI;Ll;0;L;<compat> 0066 0069;;;;N;;;;; FB02;LATIN SMALL LIGATURE FL;Ll;0;L;<compat> 0066 006C;;;;N;;;;; FB03;LATIN SMALL LIGATURE FFI;Ll;0;L;<compat> 0066 0066 0069;;;;N;;;;; FB04;LATIN SMALL LIGATURE FFL;Ll;0;L;<compat> 0066 0066 006C;;;;N;;;;; In these cases, it is the font designers which want to have code points for these characters: the glyphs of the ligature cannot be automatically derived from the glyphs of the individual characters. I can only guess that the issue with that Armenian ligature is similar. Notice that the issue of U+00DF is entirely different: it is a character on its own, not a ligature. That a common transliteration for this character exists is again a different story. Now, as to what might not work: While compatibility decomposition (NFKD) converts \u0587 to \u0565\u0582, the reverse process is not supported. This is intentional, of course: there is no "canonical" compatibility character for every decomposed code point. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list