On Saturday, 31 August 2024 00:07:57 BST G. Branden Robinson wrote: > It would be cleaner and simpler to provide a mechanism for processing a > string directly, discarding escape sequences (like vertical motions or > break points [with or without hyphenation). This point is even more > emphatic because of the heavy representation of special characters in > known use cases. That, to "sanitize" (or "pdfclean") such strings by > round-tripping them through a process that converts a sequence of easily > handled bytes like "\ [ 'a ]" or "\ [ u 0 4 1 1 ]" into a special > character node and then back again seems wasteful and fragile to me.
Hi Branden, This would be great, but I see some problems with the current code. Doing this:- [derij@pip build (master)]$ echo ".device \[u012F]"|./test-groff -Tpdf -Z | grep "^x X" x X \[u012F] [derij@pip build (master)]$ echo "\X'\[u012F]'"|test-groff -Tpdf -Z | grep "^x X" x X \[u0069_032] Shows that the \[u012F] has been decomposed (wrongly!) by \X. Whilst this might make sense for the text stream since afmtodit keys the glyphs on the decomposed unicode. I would love to know why we decompose, since none of our fonts include combining diacritical mark glyphs so neither grops nor gropdf have a chance to synthesise the glyphs from the constituent parts if it is not present in the font! Given that the purpose of \X is to pass meta-data to output drivers, which probably will convert it to utf-8 or utf16, it seems odd to decompose the output from preconv (utf16) before passing to the output driver, .device does not. The correct decompose for 012F is 0069_0328, so it is just a string truncation bug. Just like you I would like to avoid "round-tripping", utf16 (preconv) -> decomposed (troff) -> utf16 (gropdf). This does not currently affect grops which does not support anything beyond 8bit ascii. Do you agree it makes more sense for \X to pass \[u012F] rather than \[u0069_0328]? Cheers Deri