Hi Deri, At 2024-08-31T17:07:28+0100, Deri wrote: > On Saturday, 31 August 2024 00:07:57 BST G. Branden Robinson wrote: [fixing two of my own typos and one wordo in the following] > > It would be cleaner and simpler to provide a mechanism for > > processing a string directly, discarding escape sequences (like > > vertical motions or break points [with or without hyphenation]). > > This point is even more emphatic because of the heavy representation > > of special characters in known use cases. That is, to "sanitize" > > (or "pdfclean") such strings by round-tripping them through a > > process that converts a sequence of easily handled bytes like > > "\ [ ' a ]" or "\ [ u 0 4 1 1 ]" into a special character node and > > then back again seems wasteful and fragile to me. > > This would be great, but I see some problems with the current code. > Doing this:- > > [derij@pip build (master)]$ echo ".device \[u012F]"|./test-groff -Tpdf -Z | > grep "^x X" > x X \[u012F] > [derij@pip build (master)]$ echo "\X'\[u012F]'"|test-groff -Tpdf -Z | grep > "^x > X" > x X \[u0069_032] > > Shows that the \[u012F] has been decomposed (wrongly!) by \X.
You're raising two issues: The decomposed Unicode sequence should be: u0069_0328 not u0069_032 I 100% agree that that's a bug--thank you for finding it. I'll fix it. But, is doing the decomposition wrong? I think it's intended. Here's what our documentation says. groff_char(7): Unicode code points can be composed as well; when they are, GNU troff requires NFD (Normalization Form D), where all Unicode glyphs are maximally decomposed. (Exception: precomposed characters in the Latin‐1 supplement described above are also accepted. Do not count on this exception remaining in a future GNU troff that accepts UTF‐8 input directly.) Thus, GNU troff accepts “caf\['e]”, “caf\[e aa]”, and “caf\[u0065_0301]”, as ways to input “café”. (Due to its legacy 8‐bit encoding compatibility, at present it also accepts “caf\[u00E9]” on ISO Latin‐1 systems.) Here are the matches in the source, exlcuding some false positives. $ git grep 012F contrib/rfc1345/rfc1345.tmac:.char \[i;] \[u012F] \" LATIN SMALL LETTER I WITH OGONEK font/devhtml/R.proto:u0069_0328 24 0 0x012F font/devlj4/generate/text.map:433 012F u0069_0328 font/devutf8/R.proto:u0069_0328 24 0 0x012F src/libs/libgroff/uniuni.cpp: { "012F", "20069_0328" }, src/utils/afmtodit/afmtodit.tables: "012F", "0069_0328", src/utils/afmtodit/afmtodit.tables: "iogonek", "012F", src/utils/hpftodit/hpuni.cpp: { "433", "012F", }, // Lowercase I Ogonek The file "uniuni.cpp" is what's of relevance here. It stores a large decomposition table that is directly derived from the Unicode Consortium's UnicodeData.txt file. (In fact, I just updated that file for Unicode 15.1.) > Whilst this might make sense for the text stream since afmtodit keys > the glyphs on the decomposed unicode. Having one canonical decomposition in GNU troff makes _lots_ of things easier, I'm sure. > I would love to know why we decompose, I don't know. Maybe Werner can speak to the issue: he introduced the "uniuni.cpp" file in 2003 and then, in 2005, the "make-uniuni" script for regenerating it. > since none of our fonts include combining diacritical mark glyphs so > neither grops nor gropdf have a chance to synthesise the glyphs from > the constituent parts if it is not present in the font! It seems like a good thing to hold onto for the misty future when we get TTF/OTF font support. > Given that the purpose of \X is to pass meta-data to output drivers, ...which _should_ be able to handle NFD if they handle Unicode at all, right? > which probably will convert it to utf-8 or utf16, it seems odd to > decompose the output from preconv (utf16) before passing to the output > driver, It doesn't seem odd to me. The fact that Unicode has supported at least four different normalization forms since, as I recall, Unicode 3.0 (or earlier?) suggests to me that there's no one obvious answer to this problem of representation. > .device does not. The reason is that the request reads its argument in copy mode, and `\X` does not. And, uh, well, you can plan on that going away. Or, at least, for the `device` request to align with whatever `\X` does. > The correct decompose for 012F is 0069_0328, so it is just a string > truncation bug. Yes, I'm happy to fix that. > Just like you I would like to avoid "round-tripping", utf16 (preconv) > -> decomposed (troff) -> utf16 (gropdf). That's not a good example of a round trip since there is no path back from "grout" (device-independent output) to a GNU troff node list or token sequence. > This does not currently affect grops which does not support anything > beyond 8bit ascii. I'll be tackling that soonish. https://savannah.gnu.org/bugs/?62830 > Do you agree it makes more sense for \X to pass \[u012F] rather than > \[u0069_0328]? Not really. As far as I can tell there's no straightforward way to do anything different. GNU troff _unconditionally_ runs all simple (non-composite) special characters through the `decompose_unicode()` function defined in "uniuni.cpp". The place it does this is in `token::next()`, a deeply core function that handles pretty much every escape sequence in the language. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?h=1.23.0#n2295 To do what you suggest would mean I'd have to add some kind of state to the formatter that alters how that function behaves, and I'd have to do it in six places, and make sure I unwound it correctly in each case (in error pathways as well as valid ones). Why six? Because all of \X .device \! .output .cf .trf can inject stuff into "grout". That seems like a perilous path to me. I appreciate that the alternative is to hand the output drivers a problem labeled "composite Unicode character sequences". I can try my hand at trying to write a patch for gropdf(1) if you like. It feels like it should be easier than doing so in C++ (which I'll also have to do). At least if the problem is as straightforward as I think it is: Upon encountering a Unicode-style escape sequence, meaning a byte sequence starting `\[u`: [1] 0. Assert that the next character on the input stream is an uppercase hexadecimal digit. 1. Read a hexadecimal value until a non-hexadecimal character is found. Convert that value to whatever encoding the target device requires. 2. If the next character is `_`, go to 1. 3. If the next character is `]`, stop. Would you like me to give this a shot? A PDF reader expects UTF-16LE, right? Regards, Branden [1] The rules are little more complicated for GNU troff itself due to support for the escape sequences `\[ul]`, `\[ua]`, and `\[uA]`. But as presently implemented, and per my intention, these will never appear in "grout"--only Unicode code point identifiers.[2] [2] And `\[ul]` won't appear even in disguise because it maps to no defined Unicode character. But you don't get a diagnostic about it because the formatter turns it into a drawing command. $ printf '\\[ul]\n' | ./build/test-groff -T pdf -ww -Z | grep '^D' DFd Dl 5000 0
signature.asc
Description: PGP signature