On 2014-11-22 02:23, Steven D'Aprano wrote: > LATIN SMALL LETTER E > COMBINING CIRCUMFLEX ACCENT > > then my application should treat that as a single "character" and > display it as: > > LATIN SMALL LETTER E WITH CIRCUMFLEX > > which looks like this: ê > > rather than two distinct "characters" eˆ > > Now, that specific example is a no-brainer, because the Unicode > normalization routines will handle the conversion. But not every > combination of accented characters has a canonical combined form. > What about something like this? > > 'w\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}\N{COMBINING > CARON}' > > If I insert a character into my string, I want to be able to insert > before the w or after the caron, but not in the middle of those > three code points.
Things get even weirder if you have '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING OGONEK}\N{COMBINING CARON}' and when you try to do comparisons like s1 = '\N{LATIN SMALL LETTER E WITH CIRCUMFLEX}\N{COMBINING OGONEK}' s2 = 'e\N{COMBINING CIRCUMFLEX ACCENT}\N{COMBINING OGONEK}' s3 = 'e\N{COMBINING OGONEK}\N{COMBINING CIRCUMFLEX ACCENT}' print(s1 == s2) print(s1 == s3) print(s2 == s3) Then you also have the case where you want to edit text and the user wants to remove the COMBINING OGONEK from the character, so you *do* want to do something akin to s4 = ''.join(c for c in s3 if c != '\N{COMBINING OGONEK}') And yet, weird things happen if you try to remove the circumflex: for test in (s1, s2, s3): print(test == ''.join( c for c in test if c != '\N{COMBINING CIRCUMFLEX ACCENT}' ) They all make sense if you understand what's going on under the hood, but from a visual/conceptual perspective, something feels amiss. -tkc -- https://mail.python.org/mailman/listinfo/python-list