On Fri, Mar 27, 2020 at 10:24:52PM +0100, Laslo Hunhold wrote: > ... This will cover 99.5% of all cases...
What do you mean? They managed to add in grapheme cluster definition some weird edge cases up to 0.5%?? About string comparison: if I recall well, after utf-8 normalization (n11n), strings are supposed to be 100% perfect for comparison byte per byte. The more you know: utf-8 n11n got its way in linux filesystems support, and that quite recently. This will become a problem for terminal based applications. In near future gnu/linux distros, the filenames will become normalized using the "right way"(TM) n11n. This "right way"(TM) n11n (there are 2 n11ns) produces only non-pre-composed grapheme cluster of codepoints (but in the CJK realm, there are exceptions if I recall properly). AFAIK, all terminal based applications do expect "pre-composed" grapheme codepoint. For instance the french letter 'è' won't be 1 codepoint anymore, but 'e' + '`' (I don't recall the n11n order), namely a sequence of 2 codepoints. I am a bit scared because software like ncurses, lynx, links, vim, may use the abominations of software we discussed earlier to handle all this. -- Sylvain