On Fri, 27 Mar 2020 22:24:22 +0000
sylvain.bertr...@gmail.com wrote:

Dear Sylvain,

> On Fri, Mar 27, 2020 at 10:24:52PM +0100, Laslo Hunhold wrote:
> > ... This will cover 99.5% of all cases...  
> 
> What do you mean? They managed to add in grapheme cluster definition
> some weird edge cases up to 0.5%??

No, Unicode is 100% happy with how libgrapheme splits up text, but
text-rendering depends on context. It's not our problem, so don't worry
about it.

> About string comparison: if I recall well, after utf-8 normalization
> (n11n), strings are supposed to be 100% perfect for comparison byte
> per byte.

Be careful there, as there are multiple kinds of normalization. The
only two that are relevant are NFD (full decomposition) and NFC (full
composition). Unicode says that, no matter the normalization, all forms
should be equivalent. To "steal" your example from below, both 'è' and
'e' + '`' are supposed to be equivalent.
In the context of string comparison, you would have to do normalization
(preferably to NFD) and then compare byte by byte, as you properly
mentioned. HOWEVER: There can be more than one modifier to a character,
for example 'ǻ', which is 'a' + '´' + '°'. One could also write it as
'a' + '°' + '´', which is a _huge_ problem, and you can even think of
more complex examples. This is why I'd propose
byte-by-byte-comparisons, just to be sure.

> The more you know: utf-8 n11n got its way in linux filesystems
> support, and that quite recently. This will become a problem for
> terminal based applications. In near future gnu/linux distros, the
> filenames will become normalized using the "right way"(TM) n11n.

Unicode does not expect a "mandated" normalization. I personally see
composed characters as "legacy" and prefer NFD from this standpoint,
however, it is futile to attempt to mandate such a thing. The only
thing one can do is 'handle' grapheme clusters properly, no matter the
normalization, and do byte-by-byte comparisons.
File systems enforce normalization so that there won't be any files
with seemingly the same name, only with different normalization. But
this is not a deal for us from our position as a
userspace-application-developer-group.

> This "right way"(TM) n11n (there are 2 n11ns) produces only
> non-pre-composed grapheme cluster of codepoints (but in the CJK
> realm, there are exceptions if I recall properly). AFAIK, all
> terminal based applications do expect "pre-composed" grapheme
> codepoint.

Be careful there. A grapheme cluster is a set of one or more code
points. So both 'è' and 'e' + '`' are grapheme clusters, which
libgrapheme detects as such. Many terminal based applications, like st,
make the wrong assumption that a single code point was always a
grapheme cluster, but a grapheme cluster, as said above, can consist of
more than one code point.

> For instance the french letter 'è' won't be 1 codepoint anymore, but
> 'e' + '`' (I don't recall the n11n order), namely a sequence of 2
> codepoints.

Exactly.

> I am a bit scared because software like ncurses, lynx, links, vim,
> may use the abominations of software we discussed earlier to handle
> all this.

Yes, this is a huge problem. Maybe it's a bit early to talk about
libgrapheme as a solution. I first need to release version 1 and get it
out there into the distros. It's a chicken-egg-problem really, but most
packagers are very welcoming of suckless-software, as it is so easy to
package.

Thanks for your input!

With best regards

Laslo

Reply via email to