Hi Robin, At 2024-11-10T06:13:32+0300, Robin Haberkorn via GNU roff typesetting system discussion wrote: > can anybody explain why in Groff 1.23.0:
Lest anyone get apprehensive about disruptive changes in groff 1.23.0, this is isn't a change in groff 1.23.0. (respelling your exhibit a bit for brevity) $ groff=~/groff-HEAD/bin/groff; $groff --version | head -n 1; \ echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C' GNU groff version 1.23.0.2360-cf04f Cu0438_0306 $ groff=~/groff-stable/bin/groff; $groff --version | head -n 1; \ echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C' GNU groff version 1.23.0 Cu0438_0306 $ groff=/usr/bin/groff; $groff --version | head -n 1; \ echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C' GNU groff version 1.22.4 Cu0438_0306 $ groff=~/groff-1.22.3/bin/groff; $groff --version | head -n 1; \ echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C' GNU groff version 1.22.3 Cu0438_0306 That takes us back 10 years. Something different happens if we try the PostScript output device. $ groff=~/groff-HEAD/bin/groff; $groff --version | head -n 1; \ echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tps | grep '^C' GNU groff version 1.23.0.2360-cf04f troff:<standard input>:1: warning: special character 'u0438_0306' not defined The reason for this will, I hope, be clear by the end of this message. > In other words, while preconv gave the expected U+0439, Groff > transforms this into a combining character. This is then converted > back into U+0439 by grotty: > > # echo -n 'й' | preconv -eutf-8 | groff -wall -Z -Tutf8 | grotty | hexdump -C > 00000000 d0 b9 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a > |................| > 00000010 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a > |................| > * > 00000040 0a 0a 0a 0a |....| > 00000044 > > I am writing my own Groff postprocessor [1] and this gives me > headaches. Is there any algorithm to convert the combining characters > back to single codepoints or am I supposed to use large translation > tables for that? Somehow grotty is obviously doing it, but I haven't > yet read the source code. There appears to be a Unicode composition > algorithm in iconv(). glib wraps this to g_unichar_compose(). You're not looking in the right places, unfortunately. 1. groff does not use glib in any way. 2. Within groff, only preconv(1) uses iconv(3). Third, grotty(1) isn't doing it either. What you're probably looking for is this: https://git.savannah.gnu.org/cgit/groff.git/tree/src/libs/libgroff/uniuni.cpp?h=1.23.0 You did guess correctly (or nearly) that groff has a large translation table for converting between Unicode precomposed and decomposed forms (the foregoing). It also has similar tables for converting from groff's built-in special character identifiers (like `\[Eu]`) to Unicode ("glyphuni.cpp") and back ("uniglyph.cpp"). I haven't developed confidence in my command of this aspect of groff's design yet--when I do, I intend to heavily revise the corresponding sections of our Texinfo manual; "Using Symbols" is one of the more glaring areas of our documentation that hasn't yet felt my loving touch. However, what I think is going on here is the consequence of a few factors. A. Some output devices aren't capable of constructive overstriking. ECMA-48-ish terminals (mainly emulators thereof) are the leading examples here. groff's "html" output device is another. B. Without sophisticated fonts that know, for example, how to stack diacritical marks, constructive overstriking can be ugly and you don't want to do it anyway--not if the font offers appropriate precomposed glyphs that look nice. C. Some fonts simply lack coverage for the part of Unicode of interest. Historically, the faces standardized by Adobe PostScript did not cover the Cyrillic script. groff can and does support fonts with such coverage, but one has to generate descriptions for them and put them where groff can find them; see below. Given those constraints, it is necessary for GNU troff(1), the formatter that produces device-independent output, to know some properties of the output device and the fonts in use. As I put it in groff(1): GNU troff generates output in a device‐independent, but not device‐ agnostic, page description language detailed in groff_out(5). We tell GNU troff that a font has coverage for a code point or (decomposed) code point sequence by supplying an entry for it in a _font description file_; see groff_font(5). We tell GNU troff that a font (effectively) doesn't support constructive overstriking with directives in the _device description file_ "DESC" called "use_charnames_in_special" and "unicode"; see groff_font(5) again. I am not convinced that these directives are well-named, or that they isolate and orthogonalize the properties of interest. However, as I said, my understanding of the implementation remains inadequate; once it isn't, I aim to document what the truth is, after refactoring and/or revising features if necessary. > It appears, I would have to wrap this in my programming language > (SciTECO) as well, if I'd like to support all of the glyphs with > diacritics it in my postprocessor. My guess is that you will need to write a device description file for your output device. This should not be a major task; the files tend to be brief and while groff_out(5) still badly needs heavy editorial revision, I've done some sanding and polishing on groff_font(5), and it should be clear enough for you to undertake writing a "DESC" file for a "font/devsciteco" directory you create. If anything in that man page is unclear, I definitely want your feedback on the subject. > IMHO groff shouldn't decompose characters that haven't been decomposed > in its input. I think your opinion may be valid for the properties of the output device of interest to you, but these properties do not necessarily hold for all of the output devices groff already supports. If Werner sees this and has time, he can likely shed more light on these topics and/or correct any misstatements of mine. Regards, Branden
signature.asc
Description: PGP signature