Follow-up Comment #12, bug #63074 (project groff): Hi Deri,
Thank you very much for the illuminating response. [comment #11 comment #11:] > The messages which started this bug: "special characters are not defined", have very little to do with the message you recently suppressed. I followed most of the rest of this message but I didn't see where anyone suppressed anything! Can you clarify? [output comparison operator example snipped] I love a good empirical test! :D > Now to deal with why Cyrillic glyphs do not appear in the bookmark panel, but do appear in the text of the document. The text is using the embedded fonts which contain the Cyrillic glyphs mapped to appropriate code points. The bookmark panel is using whatever system font you have configured for window text. The system font will have Cyrillic glyphs but they will be using UTF code points, not the 8-bit codes available to a type 1 postscript font. Ouch! This almost seems like a lack of foresight or i18n on the part of PDF viewer programs...but perhaps not, as you address below. > The pdf standard allows two encodings for strings in pdfs. We are using PDFDocEncoding which is a superset of ISO Latin 1 and does not include cyrillics. The alternative is UTF-16 (UTF-8 is not supported), The Adobe/Microsoft axis will bedevil us forever if they get their way. > the string must start with a BOM character, and this would allow any UTF glyph to appear in bookmarks. The reason I used the 8-bit encoding is because the groff .asciify command converts the \[UXXXX] back to ascii for me and as a bonus dropped all other escapes from the string which could not be represented as ascii. So a string such as "\fB\s'+2p'foo\s'-2p'\fP" would be converted to "foo". The only niggle was the warning message (now suppressed) each time it dropped a node such as "\fB". Yes. I don't know if we want to change the "asciify" request or add a "sanitize" one, but either way there should be some means for the user to ask troff to extract only the glyphs from a string/macro/diversion, i.e., with _deliberate_ discard of everything else. That is the behavior we need for applications like this. I regret the name "asciify"; it implies too much. Like, you'll get only ASCII _output_ as opposed to only groff's representations of characters. > If I dropped the .asciify from pdf.tmac it would mean all the \[uXXXX] strings would reach the post processor gropdf, As-is? Meaning they'd appear like 'x X blah\[uXXXX]blah'? That's excellent, and what I already proposed! > which could then assemble a UTF-16 string from the hex numbers. As a proof of concept I made some changes to pdf.tmac and gropdf and pdfmom -k -f U-T mom-ru.mom produced the attached pdf. Still a fair bit to do, the biggest job is to sanitise the string to remove unwanted escapes, convert any glyph producing escapes such as \C and \N back to a UTF-16 character, and convert basic latin characters to UTF-16. I suspect a deep dive into the asciify routine in groff will be helpful. People really shouldn't have to do this in macro packages. String processing in the roff language is incredibly tedious. Saith I, after my experiences with an*abbreviate-page-title, an*abbreviate-inner-footer, and an*scan-for-backslash. https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/an.tmac I've been contemplating adding a `sanitize` request for some time. Regards, Branden _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?63074> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/