Re: Why does Groff decompose Unicode glyphs in intermediate output?

G. Branden Robinson Sun, 10 Nov 2024 01:47:08 -0800

Hi Robin,

At 2024-11-10T06:13:32+0300, Robin Haberkorn via GNU roff typesetting system 
discussion wrote:
> can anybody explain why in Groff 1.23.0:


Lest anyone get apprehensive about disruptive changes in groff 1.23.0,
this is isn't a change in groff 1.23.0.

(respelling your exhibit a bit for brevity)

$ groff=~/groff-HEAD/bin/groff; $groff --version | head -n 1; \
  echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C'
GNU groff version 1.23.0.2360-cf04f
Cu0438_0306

$ groff=~/groff-stable/bin/groff; $groff --version | head -n 1; \
  echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C'
GNU groff version 1.23.0
Cu0438_0306

$ groff=/usr/bin/groff; $groff --version | head -n 1; \
  echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C'
GNU groff version 1.22.4
Cu0438_0306

$ groff=~/groff-1.22.3/bin/groff; $groff --version | head -n 1; \
  echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tutf8 | grep '^C'
GNU groff version 1.22.3
Cu0438_0306

That takes us back 10 years.

Something different happens if we try the PostScript output device.

$ groff=~/groff-HEAD/bin/groff; $groff --version | head -n 1; \
  echo -n 'й' | ~/groff-HEAD/bin/groff -Kutf8 -ww -Z -Tps | grep '^C'
GNU groff version 1.23.0.2360-cf04f
troff:<standard input>:1: warning: special character 'u0438_0306' not defined

The reason for this will, I hope, be clear by the end of this message.

> In other words, while preconv gave the expected U+0439, Groff
> transforms this into a combining character. This is then converted
> back into U+0439 by grotty:
> 
> # echo -n 'й' | preconv -eutf-8 | groff -wall -Z -Tutf8 | grotty | hexdump -C
> 00000000  d0 b9 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a   
> |................|
> 00000010  0a 0a 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a   
> |................|
> *
> 00000040  0a 0a 0a 0a                                        |....|
> 00000044
> 
> I am writing my own Groff postprocessor [1] and this gives me
> headaches. Is there any algorithm to convert the combining characters
> back to single codepoints or am I supposed to use large translation
> tables for that?  Somehow grotty is obviously doing it, but I haven't
> yet read the source code.  There appears to be a Unicode composition
> algorithm in iconv(). glib wraps this to g_unichar_compose().

You're not looking in the right places, unfortunately.

1.  groff does not use glib in any way.
2.  Within groff, only preconv(1) uses iconv(3).

Third, grotty(1) isn't doing it either.  What you're probably looking
for is this:

https://git.savannah.gnu.org/cgit/groff.git/tree/src/libs/libgroff/uniuni.cpp?h=1.23.0

You did guess correctly (or nearly) that groff has a large translation
table for converting between Unicode precomposed and decomposed forms
(the foregoing).  It also has similar tables for converting from
groff's built-in special character identifiers (like `\[Eu]`) to Unicode
("glyphuni.cpp") and back ("uniglyph.cpp").

I haven't developed confidence in my command of this aspect of groff's
design yet--when I do, I intend to heavily revise the corresponding
sections of our Texinfo manual; "Using Symbols" is one of the more
glaring areas of our documentation that hasn't yet felt my loving touch.

However, what I think is going on here is the consequence of a few
factors.

A.  Some output devices aren't capable of constructive overstriking.
    ECMA-48-ish terminals (mainly emulators thereof) are the leading
    examples here.  groff's "html" output device is another.
B.  Without sophisticated fonts that know, for example, how to stack
    diacritical marks, constructive overstriking can be ugly and you
    don't want to do it anyway--not if the font offers appropriate
    precomposed glyphs that look nice.
C.  Some fonts simply lack coverage for the part of Unicode of interest.
    Historically, the faces standardized by Adobe PostScript did not
    cover the Cyrillic script.  groff can and does support fonts with
    such coverage, but one has to generate descriptions for them and put
    them where groff can find them; see below.

Given those constraints, it is necessary for GNU troff(1), the formatter
that produces device-independent output, to know some properties of the
output device and the fonts in use.

As I put it in groff(1):
     GNU troff generates output in a device‐independent, but not device‐
     agnostic, page description language detailed in groff_out(5).

We tell GNU troff that a font has coverage for a code point or
(decomposed) code point sequence by supplying an entry for it in a _font
description file_; see groff_font(5).

We tell GNU troff that a font (effectively) doesn't support constructive
overstriking with directives in the _device description file_ "DESC"
called "use_charnames_in_special" and "unicode"; see groff_font(5)
again.  I am not convinced that these directives are well-named, or that
they isolate and orthogonalize the properties of interest.  However, as
I said, my understanding of the implementation remains inadequate; once
it isn't, I aim to document what the truth is, after refactoring and/or
revising features if necessary.

> It appears, I would have to wrap this in my programming language
> (SciTECO) as well, if I'd like to support all of the glyphs with
> diacritics it in my postprocessor.

My guess is that you will need to write a device description file for
your output device.  This should not be a major task; the files tend to
be brief and while groff_out(5) still badly needs heavy editorial
revision, I've done some sanding and polishing on groff_font(5), and it
should be clear enough for you to undertake writing a "DESC" file for a
"font/devsciteco" directory you create.  If anything in that man page is
unclear, I definitely want your feedback on the subject.

> IMHO groff shouldn't decompose characters that haven't been decomposed
> in its input.

I think your opinion may be valid for the properties of the output
device of interest to you, but these properties do not necessarily hold
for all of the output devices groff already supports.

If Werner sees this and has time, he can likely shed more light on these
topics and/or correct any misstatements of mine.

Regards,
Branden

signature.asc
Description: PGP signature

Re: Why does Groff decompose Unicode glyphs in intermediate output?

Reply via email to