On Thu, 27 Sep 2018 19:40:06 +0000 sylvain.bertr...@gmail.com wrote: Dear Sylvain,
> I did dive a bit deeper in latest unicode, and it's even worst of > what I thought. > To deal with real unicode input/output and to split it in "extended > graphem clusters" (an unicode "char"), you need a finite state > machine (I guess that's what Lalso was referering to). And it's the > same for the "line returns" handling. it depends on how you implement it. The way I did it was to offer a function int bound(uint32_t a, uint32_t b) which returns 1 if a and b form a grapheme cluster boundary and 0 if a and b are not. In a stream-based setting you would have the following layers, starting with the raw ASCII input: 1) UTF-8-decoding (into uint32_t codepoints) 2) Grapheme cluster detection using bound() based on the uint32_t codepoints The function bound() just operates on relatively small LUTs and is pretty efficient. If we implement a font drawing library in some way, we will have to think about how we do this special handling right. Extended grapheme clusters fortunately really stand for themselves and can be a good "atom" to base font rendering on. No matter how we draw it in the "raster" at the end, it would already be a big step for st to have an "idea" of what the raw input really "means" in the drawn state. > Additionnaly, unicode NFC normalization is kind of useless (the one > chosen for the web), since they have forbidden pre-combined glyph for > a long time, you end up implementing NFD stuff anyway (that move was > obviously malicious). Yes, NFD is the only "sane" choice. > So, the real culprits are actually written languages: they suck. > Namely, you cannot write suckless code for tons of written languages, > and on top of that, simple written languages handling being > generalized with some of the most complex written languages, handling > properly those simple written languages will use the same > complex/generalized definitions and mecanisms. It's the complexity of the real world. We should not deny it and it's actually a monstrous task the Unicode consortium has undertaken and I respect them for that, even though many of their solutions seem too complicated. They also should not bend to the Emoji-crowd so easily. Unicode is the "standard" trying to encompass human language writing systems. I don't really want to think about what the people 5 generations ahead might think about the poop emoji. > On the rendering side, those complex mecanisms allow font designers > to spare a good chunk of work: the one required for pre-combined > glyphs. Expect in fonts less and less pre-combined glyphs, with a > uniq unicode points mapping to them, and that even for simple written > languages. And expect lighter font files. This is an interesting point. > It means there is no good real middle ground (a good middle ground in > the web would be, basic xhtml without javascript). Javascript has its purposes if applied lightly and always as an afterthought (i.e. the page works 100% without Javascript). > And st in all that? > Do like linux line discipline drivers? Namely do handle utf8 encoded > unicode code points (no extended graphem cluster) only, and actually > do work on ascii? As I said earlier, the terminal emulation itself is unaffected because it is more or less "blind" to the higher level of Unicode and even UTF-8. The control sequences are ASCII and the code as is works and does not need to be changed. What it's all about is the rendering part and this is a section where applications have a big say of course. Only a tiny tiny fraction of applications really does "respect" extended grapheme clusters and most, at most, do still assume code point == grapheme cluster, sbase/ubase included. This is not a bash or anything but really just due to the fact that all this processing on higher layers is a question of efficiency, especially when e.g. the UNIX system tools are used with plain ASCII data 99% of the time, not requiring all the UTF-8 processing. > For suckless, as a consistant whole, it means: > - It becomes an ascii only framework (Anselm, seems to like this), > and will be kind of useless for any text interacting application > going beyond ascii (i.e. no more mutt with non ascii email, no more > lynx with non ascii only web page...). A zero-i18n framework. In the > case of wayland st: own ascii bitmap fonts and own font renderer. I would not favor such a solution, but this is just my opinion. > - suckless gets its own unicode handling code > (libicu/freetype+harfbuzz look-alike implementation). This is the other extreme. If I found the time I'd spend more time on the library I've been working on, which is more or less optimized for stream-processing, which in my opinion many of the other unicode libraries are lacking with. I've not yet dared to touch NFD or generally normalization and string comparison, but for simple stream-based operations and to get a grasp of a stream and where the bounds for extended grapheme clusters are you, by definition of bound(), only need to know the current and previous code point to know when a "drawn character" is finished. Still even there we would need bounds, as Unicode sets no limit for the size of an extended grapheme cluster. But this is a "problem" of the implementing application itself and not of the library, which I strive to have no memory allocations at all. With best regards Laslo -- Laslo Hunhold <d...@frign.de>