On Fri, 27 Mar 2020 20:58:16 +0000 sylvain.bertr...@gmail.com wrote: Dear Sylvain,
> On this very mailing list we already had some exchange of thoughts > about the unicode grapheme cluster. > One question which was stuck into my head after this exchange was: > how many of unicode "scripts" can be rendered, in a reasonably > readable way, in a terminal grid? yeah, that's an interesting matter. "Grapheme clusters" are the "smallest unit of script", so one can think of one grapheme cluster to be a single character. It gets a bit more complicated when looking at things like "मनीष". This is a name consisting of three grapheme clusters. What's interesting is that the three letters are linked, so when thinking of text-rendering, it gets really complicated. I think, though, that it's sufficient for a terminal to be able to separate into grapheme clusters and then pass each one individually to the text renderer. This will cover 99.5% of all cases. > That said, it is a brave first step towards "suckless" "i18n" unicode > software. I got nausea looking at libunistring and the horrible > gnulib SDK, not to mention the c++ infection of the "official" libicu > (don't let me start on harfb...): they all deserve a rube goldberg > award. Yes, the ecosystem is a huge mess. What's really bad is that the software exposes the end-user to a lot of unnecessary complexity. After having thought about it for a few years, since I started working with this topic, the following is my opinion about suckless unicode handling: 1) text comparison: Don't go the unicode way, but byte-by-byte. There are too many edge-cases that just make it all suck. The problem is that within grapheme clusters, for final rendering, it doesn't matter in which order modifiers come. It's a deep deep rabbit hole if you go along this path to find canonical forms of grapheme clusters. Just compare stuff byte-for-byte and be done with it. 2) lower/upper-case: Really probably one of the worst aspects of all of this. If you are serious about that, the mappings from lower- to upper-case are not idempotent and they expand the bytestream or contract it. I personally also don't see the use of it and it's probably not worth the hassle. The concept of lower/upper-case writing is a very western concept and other scriptures don't even reflect it well. 3) sorting: really really complicated, Unicode has some "defaults" but also one million different locale-dependent rules one can choose to apply (hint: you don't want to :P). I'd first go for a "naïve" byte-by-byte approach, especially because UTF-8 is transparent in ordering relative to codepoints, but one might look into parsing the unicode and work on a sorting algorithm. An idea for a simple interface would be "grapheme_cmp(const char *, const char *)" and have the same semantics as strcmp. Obviously, given 2), an implementation for strcasecmp would not make sense, but one can also have a "grapheme_ncmp(const char *, const char *, size_t)", but then there needs to be a discussion if the size_t means the number of grapheme clusters or the number of bytes to compare. As I said above, Unicode considers permutations of modifiers equivalent, so maybe we might have to go a bit off-the-track there and skip this equivalency check. So that's that. I'll read a bit more and might code a bit in this regard. What I need to note is that Unicode gets more and more complicated with each version. For instance, I had to implement a small state machine to even be able to measure the length of grapheme clusters, which was not necessary before and forced me to adapt the API. I think it won't get worse than that and the API will work for future versions of Unicode as well, but it takes more consideration when thinking about string comparison and other things. With best regards Laslo