On Fri, 27 Mar 2020 22:24:22 +0000 sylvain.bertr...@gmail.com wrote: Dear Sylvain,
> On Fri, Mar 27, 2020 at 10:24:52PM +0100, Laslo Hunhold wrote: > > ... This will cover 99.5% of all cases... > > What do you mean? They managed to add in grapheme cluster definition > some weird edge cases up to 0.5%?? No, Unicode is 100% happy with how libgrapheme splits up text, but text-rendering depends on context. It's not our problem, so don't worry about it. > About string comparison: if I recall well, after utf-8 normalization > (n11n), strings are supposed to be 100% perfect for comparison byte > per byte. Be careful there, as there are multiple kinds of normalization. The only two that are relevant are NFD (full decomposition) and NFC (full composition). Unicode says that, no matter the normalization, all forms should be equivalent. To "steal" your example from below, both 'è' and 'e' + '`' are supposed to be equivalent. In the context of string comparison, you would have to do normalization (preferably to NFD) and then compare byte by byte, as you properly mentioned. HOWEVER: There can be more than one modifier to a character, for example 'ǻ', which is 'a' + '´' + '°'. One could also write it as 'a' + '°' + '´', which is a _huge_ problem, and you can even think of more complex examples. This is why I'd propose byte-by-byte-comparisons, just to be sure. > The more you know: utf-8 n11n got its way in linux filesystems > support, and that quite recently. This will become a problem for > terminal based applications. In near future gnu/linux distros, the > filenames will become normalized using the "right way"(TM) n11n. Unicode does not expect a "mandated" normalization. I personally see composed characters as "legacy" and prefer NFD from this standpoint, however, it is futile to attempt to mandate such a thing. The only thing one can do is 'handle' grapheme clusters properly, no matter the normalization, and do byte-by-byte comparisons. File systems enforce normalization so that there won't be any files with seemingly the same name, only with different normalization. But this is not a deal for us from our position as a userspace-application-developer-group. > This "right way"(TM) n11n (there are 2 n11ns) produces only > non-pre-composed grapheme cluster of codepoints (but in the CJK > realm, there are exceptions if I recall properly). AFAIK, all > terminal based applications do expect "pre-composed" grapheme > codepoint. Be careful there. A grapheme cluster is a set of one or more code points. So both 'è' and 'e' + '`' are grapheme clusters, which libgrapheme detects as such. Many terminal based applications, like st, make the wrong assumption that a single code point was always a grapheme cluster, but a grapheme cluster, as said above, can consist of more than one code point. > For instance the french letter 'è' won't be 1 codepoint anymore, but > 'e' + '`' (I don't recall the n11n order), namely a sequence of 2 > codepoints. Exactly. > I am a bit scared because software like ncurses, lynx, links, vim, > may use the abominations of software we discussed earlier to handle > all this. Yes, this is a huge problem. Maybe it's a bit early to talk about libgrapheme as a solution. I first need to release version 1 and get it out there into the distros. It's a chicken-egg-problem really, but most packagers are very welcoming of suckless-software, as it is so easy to package. Thanks for your input! With best regards Laslo