Re: "transparent" output and throughput, demystified

G. Branden Robinson Sat, 31 Aug 2024 22:09:58 -0700

Hi Deri,

At 2024-08-31T17:07:28+0100, Deri wrote:
> On Saturday, 31 August 2024 00:07:57 BST G. Branden Robinson wrote:
[fixing two of my own typos and one wordo in the following]
> > It would be cleaner and simpler to provide a mechanism for
> > processing a string directly, discarding escape sequences (like
> > vertical motions or break points [with or without hyphenation]).
> > This point is even more emphatic because of the heavy representation
> > of special characters in known use cases.  That is, to "sanitize"
> > (or "pdfclean") such strings by round-tripping them through a
> > process that converts a sequence of easily handled bytes like
> > "\ [ ' a ]" or "\ [ u 0 4 1 1 ]" into a special character node and
> > then back again seems wasteful and fragile to me.
> 
> This would be great, but I see some problems with the current code.
> Doing this:-
> 
> [derij@pip build (master)]$ echo ".device \[u012F]"|./test-groff -Tpdf -Z | 
> grep "^x X"
> x X \[u012F]
> [derij@pip build (master)]$ echo "\X'\[u012F]'"|test-groff -Tpdf -Z | grep 
> "^x 
> X"
> x X \[u0069_032]
> 
> Shows that the \[u012F] has been decomposed (wrongly!) by \X.


You're raising two issues:

The decomposed Unicode sequence should be:

        u0069_0328

not

        u0069_032

I 100% agree that that's a bug--thank you for finding it.  I'll fix it.

But, is doing the decomposition wrong?  I think it's intended.

Here's what our documentation says.

groff_char(7):

     Unicode code points can be composed as well; when they are, GNU
     troff requires NFD (Normalization Form D), where all Unicode glyphs
     are maximally decomposed.  (Exception: precomposed characters in
     the Latin‐1 supplement described above are also accepted.  Do not
     count on this exception remaining in a future GNU troff that
     accepts UTF‐8 input directly.)  Thus, GNU troff accepts “caf\['e]”,
     “caf\[e aa]”, and “caf\[u0065_0301]”, as ways to input “café”.
     (Due to its legacy 8‐bit encoding compatibility, at present it also
     accepts “caf\[u00E9]” on ISO Latin‐1 systems.)

Here are the matches in the source, exlcuding some false positives.

$ git grep 012F
contrib/rfc1345/rfc1345.tmac:.char \[i;] \[u012F]    \" LATIN SMALL LETTER I 
WITH OGONEK
font/devhtml/R.proto:u0069_0328 24      0       0x012F
font/devlj4/generate/text.map:433       012F    u0069_0328
font/devutf8/R.proto:u0069_0328 24      0       0x012F
src/libs/libgroff/uniuni.cpp:  { "012F", "20069_0328" },
src/utils/afmtodit/afmtodit.tables:  "012F", "0069_0328",
src/utils/afmtodit/afmtodit.tables:  "iogonek", "012F",
src/utils/hpftodit/hpuni.cpp:  { "433", "012F", },      // Lowercase I Ogonek

The file "uniuni.cpp" is what's of relevance here.  It stores a large
decomposition table that is directly derived from the Unicode
Consortium's UnicodeData.txt file.  (In fact, I just updated that file
for Unicode 15.1.)

> Whilst this might make sense for the text stream since afmtodit keys
> the glyphs on the decomposed unicode.

Having one canonical decomposition in GNU troff makes _lots_ of things
easier, I'm sure.

> I would love to know why we decompose,

I don't know.  Maybe Werner can speak to the issue: he introduced the
"uniuni.cpp" file in 2003 and then, in 2005, the "make-uniuni" script
for regenerating it.

> since none of our fonts include combining diacritical mark glyphs so
> neither grops nor gropdf have a chance to synthesise the  glyphs from
> the constituent parts if it is not present in the font!

It seems like a good thing to hold onto for the misty future when we get
TTF/OTF font support.

> Given that the purpose of \X is to pass meta-data to output drivers,

...which _should_ be able to handle NFD if they handle Unicode at all,
right?

> which probably will convert it to utf-8 or utf16, it seems odd to
> decompose the output from preconv (utf16) before passing to the output
> driver,

It doesn't seem odd to me.  The fact that Unicode has supported at least
four different normalization forms since, as I recall, Unicode 3.0 (or
earlier?) suggests to me that there's no one obvious answer to this
problem of representation.

> .device does not.

The reason is that the request reads its argument in copy mode, and `\X`
does not.

And, uh, well, you can plan on that going away.  Or, at least, for the
`device` request to align with whatever `\X` does.

> The correct decompose for 012F is 0069_0328, so it is just a string
> truncation bug.

Yes, I'm happy to fix that.

> Just like you I would like to avoid "round-tripping", utf16 (preconv)
> -> decomposed (troff) -> utf16 (gropdf).

That's not a good example of a round trip since there is no path back
from "grout" (device-independent output) to a GNU troff node list or
token sequence.

> This does not currently affect grops which does not support anything
> beyond 8bit ascii.

I'll be tackling that soonish.

https://savannah.gnu.org/bugs/?62830

> Do you agree it makes more sense for \X to pass \[u012F] rather than
> \[u0069_0328]?

Not really.  As far as I can tell there's no straightforward way to do
anything different.  GNU troff _unconditionally_ runs all simple
(non-composite) special characters through the `decompose_unicode()`
function defined in "uniuni.cpp".

The place it does this is in `token::next()`, a deeply core function
that handles pretty much every escape sequence in the language.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n2295

To do what you suggest would mean I'd have to add some kind of state to
the formatter that alters how that function behaves, and I'd have to do
it in six places, and make sure I unwound it correctly in each case (in
error pathways as well as valid ones).  Why six?

Because all of

        \X
        .device
        \!
        .output
        .cf
        .trf

can inject stuff into "grout".

That seems like a perilous path to me.

I appreciate that the alternative is to hand the output drivers a
problem labeled "composite Unicode character sequences".  I can try my
hand at trying to write a patch for gropdf(1) if you like.  It feels
like it should be easier than doing so in C++ (which I'll also have to
do).

At least if the problem is as straightforward as I think it is:

Upon encountering a Unicode-style escape sequence, meaning a byte
sequence starting `\[u`: [1]

0.  Assert that the next character on the input stream is an uppercase
    hexadecimal digit.

1.  Read a hexadecimal value until a non-hexadecimal character is found.
    Convert that value to whatever encoding the target device requires.

2.  If the next character is `_`, go to 1.

3.  If the next character is  `]`, stop.

Would you like me to give this a shot?  A PDF reader expects UTF-16LE,
right?

Regards,
Branden

[1] The rules are little more complicated for GNU troff itself due to
    support for the escape sequences `\[ul]`, `\[ua]`, and `\[uA]`.  But
    as presently implemented, and per my intention, these will never
    appear in "grout"--only Unicode code point identifiers.[2]

[2] And `\[ul]` won't appear even in disguise because it maps to no
    defined Unicode character.  But you don't get a diagnostic about it
    because the formatter turns it into a drawing command.

    $ printf '\\[ul]\n' | ./build/test-groff -T pdf -ww -Z | grep '^D'
    DFd
    Dl 5000 0

signature.asc
Description: PGP signature

Re: "transparent" output and throughput, demystified

Reply via email to