On Sunday, 1 September 2024 06:09:17 BST G. Branden Robinson wrote: > Hi Deri, > > At 2024-08-31T17:07:28+0100, Deri wrote: > > On Saturday, 31 August 2024 00:07:57 BST G. Branden Robinson wrote: > [fixing two of my own typos and one wordo in the following] > > > > It would be cleaner and simpler to provide a mechanism for > > > processing a string directly, discarding escape sequences (like > > > vertical motions or break points [with or without hyphenation]). > > > This point is even more emphatic because of the heavy representation > > > of special characters in known use cases. That is, to "sanitize" > > > (or "pdfclean") such strings by round-tripping them through a > > > process that converts a sequence of easily handled bytes like > > > "\ [ ' a ]" or "\ [ u 0 4 1 1 ]" into a special character node and > > > then back again seems wasteful and fragile to me. > > > > This would be great, but I see some problems with the current code. > > Doing this:- > > > > [derij@pip build (master)]$ echo ".device \[u012F]"|./test-groff -Tpdf -Z > > | > > grep "^x X" > > x X \[u012F] > > [derij@pip build (master)]$ echo "\X'\[u012F]'"|test-groff -Tpdf -Z | grep > > "^x X" > > x X \[u0069_032] > > > > Shows that the \[u012F] has been decomposed (wrongly!) by \X. > > You're raising two issues: > > The decomposed Unicode sequence should be: > > u0069_0328 > > not > > u0069_032 > > I 100% agree that that's a bug--thank you for finding it. I'll fix it. > > But, is doing the decomposition wrong? I think it's intended. > > Here's what our documentation says. > > groff_char(7): > > Unicode code points can be composed as well; when they are, GNU > troff requires NFD (Normalization Form D), where all Unicode glyphs > are maximally decomposed. (Exception: precomposed characters in > the Latin‐1 supplement described above are also accepted. Do not > count on this exception remaining in a future GNU troff that > accepts UTF‐8 input directly.) Thus, GNU troff accepts “caf\['e]”, > “caf\[e aa]”, and “caf\[u0065_0301]”, as ways to input “café”. > (Due to its legacy 8‐bit encoding compatibility, at present it also > accepts “caf\[u00E9]” on ISO Latin‐1 systems.)
Exactly, it says it "can" be composed, not that it must be (this text was added by you post 1.22.4), in fact most \[uxxxx] input to groff is not composed (comes from preconv). Troff then performs NFD conversion (so that it matches a named glyph in the font). Conceptually this is a stream of named glyphs, there is a second stream of device control text, and this has nothing to do with fonts or glyph names. Device controls are passed by .device (and friends). \[u0069_0328] is a named glyph in a font, \[u012F] is a 7bit ascii representation (provided by preconv) of the unicode code point. The groff_char(7) you quote is simply saying that input to groff can be composited or not. How has that any bearing on how troff talks to its drivers. If a user actually wants to use a composite character this is saying you can enter \[u0069_0328] or you can leave it to preconv to use \ [u012F]. Unfortunately the way you intend to change groff, document text will always use the single glyph (if available) and meta-data will always use a composite glyph. So there is no real choice for the user. User facing programs use NFD, since it makes it easier to sort and search the glyph stream. Neither grops nor gropdf are "user facing", they are generators of documents which require a viewer or printer to render them, the only user facing driver is possibly X11. There is a visible difference between using NFD and using the actual unicode text character when specifying pdf bookmarks. The attached PDF has screenshots of the bookmark panel, using \[u0069_0328] NFD and \[u012F] NFC. The example using \ [u012F] is superior (in my opinion) because it is using a single glyph the font designer intended for that character rather than combining two glyphs that don't marry up too well. > Here are the matches in the source, exlcuding some false positives. > > $ git grep 012F > contrib/rfc1345/rfc1345.tmac:.char \[i;] \[u012F] \" LATIN SMALL LETTER I > WITH OGONEK font/devhtml/R.proto:u0069_0328 24 0 0x012F > font/devlj4/generate/text.map:433 012F u0069_0328 > font/devutf8/R.proto:u0069_0328 24 0 0x012F > src/libs/libgroff/uniuni.cpp: { "012F", "20069_0328" }, > src/utils/afmtodit/afmtodit.tables: "012F", "0069_0328", > src/utils/afmtodit/afmtodit.tables: "iogonek", "012F", > src/utils/hpftodit/hpuni.cpp: { "433", "012F", }, // Lowercase I > Ogonek > > The file "uniuni.cpp" is what's of relevance here. It stores a large > decomposition table that is directly derived from the Unicode > Consortium's UnicodeData.txt file. (In fact, I just updated that file > for Unicode 15.1.) This has no bearing on whether it is sensible to use NFD to send text to output drivers rather than the actual unicode value of the character. > > Whilst this might make sense for the text stream since afmtodit keys > > the glyphs on the decomposed unicode. > > Having one canonical decomposition in GNU troff makes _lots_ of things > easier, I'm sure. > > > I would love to know why we decompose, > > I don't know. Maybe Werner can speak to the issue: he introduced the > "uniuni.cpp" file in 2003 and then, in 2005, the "make-uniuni" script > for regenerating it. > > > since none of our fonts include combining diacritical mark glyphs so > > neither grops nor gropdf have a chance to synthesise the glyphs from > > the constituent parts if it is not present in the font! > > It seems like a good thing to hold onto for the misty future when we get > TTF/OTF font support. So, it does not make sense now, but might in the future. I would concede here if composited glyphs were as good as the single glyph provided in the font, but the PDF attached shows this is not always true. Also, from TTF/ OTF fonts I've examined, if the font contains combining diacritics it also contains glyphs for all the base characters which can use a diacritic, since it is just calls to subroutines with any nnecessary repositioning. If you know of any fonts which include combining diacritics but don't provide single glyphs with the base character and the diacritic combined, please correct me. > > Given that the purpose of \X is to pass meta-data to output drivers, > > ...which _should_ be able to handle NFD if they handle Unicode at all, > right? Of course, much better for any sort, which grops/gropdf do not do, and if they did, would of course change the given text to NFD prior to sorting. As regards searching, it's a bit of a two edged sword. For example, if the word "ocksŮ" in a utf8 document is used as a text heading and a bookmark entry (think .SH "ocksŮ") preconv converts the "Ů" to \[u016E], troff then applies NFD to match a glyph name in the U-TR font - \[u0055_030A]. When .device and .output used "copy in" mode the original unicode code point \ [u016E] was passed to the device, but with the recent changes, "new mode 3", to \X are rolled out to the other 7(?) commands which communicate text to the device drivers, they receive instead \[u0055_030A]. If this composite code (4 bytes in UTF16) is used as the bookmark text, we have seen it can produce less optimum results in the bookmark pane but it also can screw up searching in the pdf viewer. Okular (a pdf viewer) has two search boxes, one for the text, entering "ocksŮ" here will find the heading, the second search box is for the bookmarks and entering "ocksŮ" will fail to find the bookmark since the final character is in fact two characters. This result may surprise users, that entering exactly the same keystrokes as they used when writing the document, finds the text in the document, but fails to find the bookmark. Then why does it work in the text search, you may ask, since they have both been passed an NCD composite code. The answer is because in the grout passed to the driver it becomes "Cu0055_030A" and although this looks like unicode it is just the name of a glyph in the font, just as "Caq" in grout will find the "quotesingle" glyph. The font header in the pdf identifies the postscript name of each glyph used for the document text and the pdf viewer has a lookup table which converts postscript name "Uring" to U+016E "Ů" (back where we started). > > which probably will convert it to utf-8 or utf16, it seems odd to > > decompose the output from preconv (utf16) before passing to the output > > driver, > > It doesn't seem odd to me. The fact that Unicode has supported at least > four different normalization forms since, as I recall, Unicode 3.0 (or > earlier?) suggests to me that there's no one obvious answer to this > problem of representation. As I've shown the NCD used in grout (Cuxxxx_xxxx) is simply a key to a font glyph, this information that this glyph is a composite is entirely unnecessary for device control text, I need to know the unicode code point delivered by preconv, so I can deliver that single character back as UTF16 text. > > .device does not. > > The reason is that the request reads its argument in copy mode, and `\X` > does not. > > And, uh, well, you can plan on that going away. Or, at least, for the > `device` request to align with whatever `\X` does. > > > The correct decompose for 012F is 0069_0328, so it is just a string > > truncation bug. > > Yes, I'm happy to fix that. > > > Just like you I would like to avoid "round-tripping", utf16 (preconv) > > -> decomposed (troff) -> utf16 (gropdf). > > That's not a good example of a round trip since there is no path back > from "grout" (device-independent output) to a GNU troff node list or > token sequence. I used round-tripping in the general sense that, after processing you end up back where you started (the same as you used it). Why does groff have to be involved for something to be considfered a round-trip? > > This does not currently affect grops which does not support anything > > beyond 8bit ascii. > > I'll be tackling that soonish. > > https://savannah.gnu.org/bugs/?62830 > > > Do you agree it makes more sense for \X to pass \[u012F] rather than > > \[u0069_0328]? > > Not really. As far as I can tell there's no straightforward way to do > anything different. GNU troff _unconditionally_ runs all simple > (non-composite) special characters through the `decompose_unicode()` > function defined in "uniuni.cpp". Ok, if it can't be done, just leave what you have changed in \X, but leave .device and .output (plus friends) to the current copy-in mode which seem to be working fine as they are now, unless you have an example which demonstrates a problem which your code solves. The only example you gave of what you are "fixing", the .AUTHOR line in a mom example doc, actually works fine, so is probanly not a good example to justiify your changes. > The place it does this is in `token::next()`, a deeply core function > that handles pretty much every escape sequence in the language. > > https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/ input.cpp?h= > 1.23.0#n2295 > > To do what you suggest would mean I'd have to add some kind of state to > the formatter that alters how that function behaves, and I'd have to do > it in six places, and make sure I unwound it correctly in each case (in > error pathways as well as valid ones). Why six? > > Because all of > > \X > .device > \! > .output > .cf > .trf Why are two missing? > can inject stuff into "grout". > > That seems like a perilous path to me. Not if you restrict the changes to \X only, and document the difference in behaviour from the other 7 methods. > I appreciate that the alternative is to hand the output drivers a > problem labeled "composite Unicode character sequences". I can try my > hand at trying to write a patch for gropdf(1) if you like. It feels > like it should be easier than doing so in C++ (which I'll also have to > do). It is not a problem I can certainly embed a composite glyph as part of a bookmark, the problem is that it does not always look very good (see pdf) and messes up searching for bookmarks. > At least if the problem is as straightforward as I think it is: > > Upon encountering a Unicode-style escape sequence, meaning a byte > sequence starting `\[u`: [1] > > 0. Assert that the next character on the input stream is an uppercase > hexadecimal digit. > > 1. Read a hexadecimal value until a non-hexadecimal character is found. > Convert that value to whatever encoding the target device requires. > > 2. If the next character is `_`, go to 1. > > 3. If the next character is `]`, stop. > > Would you like me to give this a shot? A PDF reader expects UTF-16LE, > right? Have a go if you want, I've got it down to 10 extra lines, but the results may be depressing (see PDF). > Regards, > Branden > > [1] The rules are little more complicated for GNU troff itself due to > support for the escape sequences `\[ul]`, `\[ua]`, and `\[uA]`. But > as presently implemented, and per my intention, these will never > appear in "grout"--only Unicode code point identifiers.[2] > > [2] And `\[ul]` won't appear even in disguise because it maps to no > defined Unicode character. But you don't get a diagnostic about it > because the formatter turns it into a drawing command. > > $ printf '\\[ul]\n' | ./build/test-groff -T pdf -ww -Z | grep '^D' > DFd > Dl 5000 0
NCDvCopyIn.pdf
Description: Adobe PDF document