Re: "transparent" output and throughput, demystified

Deri Wed, 04 Sep 2024 09:06:13 -0700

On Sunday, 1 September 2024 06:09:17 BST G. Branden Robinson wrote:
> Hi Deri,
> 
> At 2024-08-31T17:07:28+0100, Deri wrote:
> > On Saturday, 31 August 2024 00:07:57 BST G. Branden Robinson wrote:
> [fixing two of my own typos and one wordo in the following]
> 
> > > It would be cleaner and simpler to provide a mechanism for
> > > processing a string directly, discarding escape sequences (like
> > > vertical motions or break points [with or without hyphenation]).
> > > This point is even more emphatic because of the heavy representation
> > > of special characters in known use cases.  That is, to "sanitize"
> > > (or "pdfclean") such strings by round-tripping them through a
> > > process that converts a sequence of easily handled bytes like
> > > "\ [ ' a ]" or "\ [ u 0 4 1 1 ]" into a special character node and
> > > then back again seems wasteful and fragile to me.
> > 
> > This would be great, but I see some problems with the current code.
> > Doing this:-
> > 
> > [derij@pip build (master)]$ echo ".device \[u012F]"|./test-groff -Tpdf 
-Z
> > |
> > grep "^x X"
> > x X \[u012F]
> > [derij@pip build (master)]$ echo "\X'\[u012F]'"|test-groff -Tpdf -Z | 
grep
> > "^x X"
> > x X \[u0069_032]
> > 
> > Shows that the \[u012F] has been decomposed (wrongly!) by \X.
> 
> You're raising two issues:
> 
> The decomposed Unicode sequence should be:
> 
>       u0069_0328
> 
> not
> 
>       u0069_032
> 
> I 100% agree that that's a bug--thank you for finding it.  I'll fix it.
> 
> But, is doing the decomposition wrong?  I think it's intended.
> 
> Here's what our documentation says.
> 
> groff_char(7):
> 
>      Unicode code points can be composed as well; when they are, GNU
>      troff requires NFD (Normalization Form D), where all Unicode glyphs
>      are maximally decomposed.  (Exception: precomposed characters in
>      the Latin‐1 supplement described above are also accepted.  Do not
>      count on this exception remaining in a future GNU troff that
>      accepts UTF‐8 input directly.)  Thus, GNU troff accepts “caf\['e]”,
>      “caf\[e aa]”, and “caf\[u0065_0301]”, as ways to input “café”.
>      (Due to its legacy 8‐bit encoding compatibility, at present it also
>      accepts “caf\[u00E9]” on ISO Latin‐1 systems.)


Exactly, it says it "can" be composed, not that it must be (this text was 
added by you post 1.22.4), in fact most \[uxxxx] input to groff is not 
composed (comes from preconv). Troff then performs NFD conversion (so that 
it matches a named glyph in the font). Conceptually this is a stream of 
named glyphs, there is a second stream of device control text, and this 
has nothing to do with fonts or glyph names. Device controls are passed by 
.device (and friends). \[u0069_0328] is a named glyph in a font, \[u012F] 
is a 7bit ascii representation (provided by preconv) of the unicode code 
point.

The groff_char(7) you quote is simply saying that input to groff can be 
composited or not. How has that any bearing on how troff talks to its 
drivers. If a user actually wants to use a composite character this is 
saying you can enter \[u0069_0328] or you can leave it to preconv to use \
[u012F]. Unfortunately the way you intend to change groff, document text 
will always use the single glyph (if available) and meta-data will always  
use a composite glyph. So there is no real choice for the user.
  
User facing programs use NFD, since it makes it easier to sort and search 
the glyph stream. Neither grops nor gropdf are "user facing", they are 
generators of documents which require a viewer or printer to render them, 
the only user facing driver is possibly X11. There is a visible difference 
between using NFD and using the actual unicode text character when 
specifying pdf bookmarks. The attached PDF has screenshots of the bookmark 
panel, using \[u0069_0328] NFD and \[u012F] NFC. The example using \
[u012F] is superior (in my opinion) because it is using a single glyph the 
font designer intended for that character rather than combining two glyphs 
that don't marry up too well.

> Here are the matches in the source, exlcuding some false positives.
> 
> $ git grep 012F
> contrib/rfc1345/rfc1345.tmac:.char \[i;] \[u012F]    \" LATIN SMALL 
LETTER I
> WITH OGONEK font/devhtml/R.proto:u0069_0328 24      0       0x012F
> font/devlj4/generate/text.map:433       012F    u0069_0328
> font/devutf8/R.proto:u0069_0328 24      0       0x012F
> src/libs/libgroff/uniuni.cpp:  { "012F", "20069_0328" },
> src/utils/afmtodit/afmtodit.tables:  "012F", "0069_0328",
> src/utils/afmtodit/afmtodit.tables:  "iogonek", "012F",
> src/utils/hpftodit/hpuni.cpp:  { "433", "012F", },      // Lowercase I
> Ogonek
> 
> The file "uniuni.cpp" is what's of relevance here.  It stores a large
> decomposition table that is directly derived from the Unicode
> Consortium's UnicodeData.txt file.  (In fact, I just updated that file
> for Unicode 15.1.)

This has no bearing on whether it is sensible to use NFD to send text to 
output drivers rather than the actual unicode value of the character.

> > Whilst this might make sense for the text stream since afmtodit keys
> > the glyphs on the decomposed unicode.
> 
> Having one canonical decomposition in GNU troff makes _lots_ of things
> easier, I'm sure.
> 
> > I would love to know why we decompose,
> 
> I don't know.  Maybe Werner can speak to the issue: he introduced the
> "uniuni.cpp" file in 2003 and then, in 2005, the "make-uniuni" script
> for regenerating it.
> 
> > since none of our fonts include combining diacritical mark glyphs so
> > neither grops nor gropdf have a chance to synthesise the  glyphs from
> > the constituent parts if it is not present in the font!
> 
> It seems like a good thing to hold onto for the misty future when we get
> TTF/OTF font support.

So, it does not make sense now, but might in the future. I would concede 
here if composited glyphs were as good as the single glyph provided in the  
font, but the PDF attached shows this is not always true. Also, from TTF/
OTF fonts I've examined, if the font contains combining diacritics it also 
contains glyphs for all the base characters which can use a diacritic, 
since it is just calls to subroutines with any nnecessary repositioning. 
If you know of any fonts which include combining diacritics but don't 
provide single glyphs with the base character and the diacritic combined, 
please correct me.

> > Given that the purpose of \X is to pass meta-data to output drivers,
> 
> ...which _should_ be able to handle NFD if they handle Unicode at all,
> right?

Of course, much better for any sort, which grops/gropdf  do not do, and if 
they did, would of course change the given text to NFD prior to sorting. 
As regards searching, it's a bit of a two edged sword. For example, if the 
word "ocksŮ" in a utf8 document is used as a text heading and a bookmark 
entry (think .SH "ocksŮ") preconv converts the "Ů" to \[u016E], troff then 
applies NFD to match a glyph name in the U-TR font - \[u0055_030A]. When 
.device and .output used "copy in" mode the original unicode code point \
[u016E] was passed to the device, but with the recent changes, "new mode 
3", to \X are rolled out to the other 7(?) commands which communicate text 
to the device drivers, they receive instead \[u0055_030A]. If this 
composite code (4 bytes in UTF16) is used as the bookmark text, we have 
seen it can produce less optimum results in the bookmark pane but it also 
can screw up searching in the pdf viewer. Okular (a pdf viewer) has two 
search boxes, one for the text, entering "ocksŮ" here will find the 
heading, the second search box is for the bookmarks and entering "ocksŮ" 
will fail to find the bookmark since the final character is in fact two 
characters. This result may surprise users, that entering exactly the same 
keystrokes as they used when writing the document, finds the text in the 
document, but fails to find the bookmark.

Then why does it work in the text search, you may ask, since they have 
both been passed an NCD composite code. The answer is because  in the 
grout passed to the driver it becomes "Cu0055_030A" and although this 
looks like unicode it is just the name of a glyph in the font, just as 
"Caq" in grout will find the "quotesingle" glyph. The font header in  the 
pdf identifies the postscript name of each glyph used for the document 
text and the pdf viewer has a lookup table which converts postscript name 
"Uring" to U+016E "Ů" (back where we started).

> > which probably will convert it to utf-8 or utf16, it seems odd to
> > decompose the output from preconv (utf16) before passing to the output
> > driver,
> 
> It doesn't seem odd to me.  The fact that Unicode has supported at least
> four different normalization forms since, as I recall, Unicode 3.0 (or
> earlier?) suggests to me that there's no one obvious answer to this
> problem of representation.

As I've shown the NCD used in grout (Cuxxxx_xxxx) is simply a key to a 
font glyph, this information that this glyph is a composite is entirely 
unnecessary for device control text, I need to know the unicode code point  
delivered by preconv, so I can deliver that single character back as UTF16 
text.

> > .device does not.
> 
> The reason is that the request reads its argument in copy mode, and `\X`
> does not.
> 
> And, uh, well, you can plan on that going away.  Or, at least, for the
> `device` request to align with whatever `\X` does.
> 
> > The correct decompose for 012F is 0069_0328, so it is just a string
> > truncation bug.
> 
> Yes, I'm happy to fix that.
> 
> > Just like you I would like to avoid "round-tripping", utf16 (preconv)
> > -> decomposed (troff) -> utf16 (gropdf).
> 
> That's not a good example of a round trip since there is no path back
> from "grout" (device-independent output) to a GNU troff node list or
> token sequence.

I used round-tripping in the general sense that, after processing you end 
up back where you started (the same as you used it). Why does groff have 
to be involved for something to be considfered a round-trip? 
> > This does not currently affect grops which does not support anything
> > beyond 8bit ascii.
> 
> I'll be tackling that soonish.
> 
> https://savannah.gnu.org/bugs/?62830
> 
> > Do you agree it makes more sense for \X to pass \[u012F] rather than
> > \[u0069_0328]?
> 
> Not really.  As far as I can tell there's no straightforward way to do
> anything different.  GNU troff _unconditionally_ runs all simple
> (non-composite) special characters through the `decompose_unicode()`
> function defined in "uniuni.cpp".

Ok, if it can't be done, just leave what you have changed in \X, but leave 
.device  and .output (plus friends) to the current copy-in mode which seem 
to be working fine as they are now, unless you have an example which 
demonstrates a problem which your code solves. The only example you gave 
of what you are "fixing", the .AUTHOR line in a mom example doc, actually 
works fine, so is probanly not a good example to justiify your changes.

> The place it does this is in `token::next()`, a deeply core function
> that handles pretty much every escape sequence in the language.
> 
> https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/
input.cpp?h=
> 1.23.0#n2295
> 
> To do what you suggest would mean I'd have to add some kind of state to
> the formatter that alters how that function behaves, and I'd have to do
> it in six places, and make sure I unwound it correctly in each case (in
> error pathways as well as valid ones).  Why six?
> 
> Because all of
> 
>       \X
>       .device
>       \!
>       .output
>       .cf
>       .trf

Why are two missing?
 
> can inject stuff into "grout".
> 
> That seems like a perilous path to me.

Not if you restrict the changes to \X only, and document the difference in 
behaviour from the other 7 methods.

> I appreciate that the alternative is to hand the output drivers a
> problem labeled "composite Unicode character sequences".  I can try my
> hand at trying to write a patch for gropdf(1) if you like.  It feels
> like it should be easier than doing so in C++ (which I'll also have to
> do).

It is not a problem I can certainly embed a composite glyph as part of a
bookmark, the problem is that it does not always look very good (see pdf) 
and messes up searching for bookmarks.

> At least if the problem is as straightforward as I think it is:
> 
> Upon encountering a Unicode-style escape sequence, meaning a byte
> sequence starting `\[u`: [1]
> 
> 0.  Assert that the next character on the input stream is an uppercase
>     hexadecimal digit.
> 
> 1.  Read a hexadecimal value until a non-hexadecimal character is found.
>     Convert that value to whatever encoding the target device requires.
> 
> 2.  If the next character is `_`, go to 1.
> 
> 3.  If the next character is  `]`, stop.
> 
> Would you like me to give this a shot?  A PDF reader expects UTF-16LE,
> right?

Have a go if you want, I've got it down to 10 extra lines, but the results 
may be depressing (see PDF).

> Regards,
> Branden


> 
> [1] The rules are little more complicated for GNU troff itself due to
>     support for the escape sequences `\[ul]`, `\[ua]`, and `\[uA]`.  But
>     as presently implemented, and per my intention, these will never
>     appear in "grout"--only Unicode code point identifiers.[2]
> 
> [2] And `\[ul]` won't appear even in disguise because it maps to no
>     defined Unicode character.  But you don't get a diagnostic about it
>     because the formatter turns it into a drawing command.
> 
>     $ printf '\\[ul]\n' | ./build/test-groff -T pdf -ww -Z | grep '^D'
>     DFd
>     Dl 5000 0

NCDvCopyIn.pdf
Description: Adobe PDF document

Re: "transparent" output and throughput, demystified

Reply via email to