On Tue, 25 Sep 2018 21:25:12 +0000 sylvain.bertr...@gmail.com wrote: Dear Sylvain,
> An unicode string has 4 canonical normalizations. But only one (NFD) > seems to be futur proof regarding what features will be supported by > font files (opentype(microsoft tm)/open font format). this is true, as only the full decomposition is even remotely "mendable". All the other canonical forms are a nightmare. The upside to this is that these normalizations are only really relevant when you do text operations like comparisons and other things. The way codepoints are laid out does not immediately say anything about the drawn size of a glyph, which will require other means. st for this matter does and in a way should not mess with normalization and Unicode fortunately defined glyph-boundaries relatively simply. I have a very minimalistic library on my hard drive idling for almost 3 years now (Roberto and I hacked on it at Budapest after slcon2 in 2015). > Ofc, this is the one canonical normalizations which hard depends on > harfbuzz shaping in freetype. For instance the glyph 'é' won't be > anymore 1 glyph (a "pre-combined" glyph) in the font file but will be > the combined rendering of 'e' + 'combining accent' glyphs which only > harbuzz understands and not freetype alone. Font designers are pushed > to avoid making "pre-combined" glyphs: pre-combined glyphs are not > allowed in unicode anymore (actually, it has been the case for quite > some time). And that's the simple case of combined glyphs... This is the true issue, yes, and this whole concept is way ahead of the technological ecosystem. Only now have people begun to respect "codepoints" and not just the code units themselves. However, what Unicode preaches is that a glyph is free to be composed of arbitrarily many codepoints, complicating the whole manner a lot. > Additionally, xml smile/svg vector rendering was introduced in the > otf/ttf font format with animated color emojis: A futur "clean" pure > xml font format is lurking on the horizon (open type 2?). We should ignore this nonsense. > The unicode canonical normalization also affects input: the > application won't receive anymore 1 unicode code point for a > "pre-combined" symbol 'é', but 2 unicode code points 'e' + 'combining > accent'. This is not a problem with st, but a general "issue" in text processing. I heavily researched this a while back and if we e.g. went 100% with that in sbase, we would always have to have a normalizer running in the background. I wrote a simple parser in awk(1) that takes the Unicode-data and turns it into a LUT for NFD-processing, but it all complicates things a lot. I understand why they did it like this, but this is UTF16 all over again where people, given the lack of surrogate characters in common input data, made the mistake to always assume a code point to only consist of 16 bits while all these surrogate characters can actually be composed of two 16 bit units. > st is surrounded. I wouldn't overdramatize this. The terminal emulation backend couldn't care less about Unicode and all that and the robustness of UTF-8 allows us to just carry on. The real problem is to "judge" how much space the given data is going to take in the drawing step, not to forget about the huge problem with the font drawing library. As you already mentioned, having all this NFD-combining-mess definitely complicates the process of font-drawing compared to just having these characters already "ready" for use. > The suckless futur proof solution: it is over, st goes 7bits ascii > only with it's own bitmap fonts... non english-only terminal users > will just trash it. > > ... or a suckless futur proof unicode/font stack will have to be > coded: > - unicode normalizer (NFD) (like ICU) ICU is a dead end, as it loads localized data on the fly. The normalizer, if implemented, would only use the "global" tables. Such a normalizer would not be necessary for st though and we would only need a tool to count glyphs, which I've already done. > - a full xml smile/svg vector renderer (like librsvg/expat for > the svg part) No, forget about SVG fonts. Nobody sane would think about implementing this while keeping simplicity and security in mind. > - a ttf/otf -> xml svg translator (in freetype). There's no need to translate to SVG. TTF/OTF is actually a quite convenient vector format and if one were to develop a font-rendering-library, he would want to split up the tasks into three steps: 1) Parsing TTF/OTF files 2) Assembling vector drawing instructions (hardest part) 3) Rasterization (watch out for patents here) > ... or st becomes like surf: an app which is a thin suckless wrapper > around a huge pile of ... You know what: st would be better of being > a thin wrapper around libvte then, because it would be even thiner. We shouldn't throw the baby out with the bathwater, in my opinion. There is lots of pent up frustration out there about freetype/fontconfig and there are relatively simple solutions that could be a starting point for a solid homegrown solution. I hope it does not sound like NiH syndrome, but the madness needs to stop and freetype/fontconfig is a horrible security hole. The only thing you really need for a font-database is a list of fonts in descending order (i.e. a fallback-array). The API for such a library, lets call it sfl (suckless font library), would be very simple: struct sfl { ... }; sfl_init(struct sfl *s, char **files, size_t nfiles); sfl_draw(...); sfl_free(struct sfl *s); Some functionalities, like getting the "length" of the drawn string, can be realized by e.g. passing NULL for the drawing surface in sfl_draw(), no matter now how we implement it in detail. But this is just theory. I didn't have time to study the TTF/OTF formats but am sure that we should not just give up on this topic. It just doesn't sound right to recommend people to use UTF-8 while disregarding 25 years of this development and non-English languages. With best regards Laslo -- Laslo Hunhold <d...@frign.de>