[bug #63074] [troff] need a way to embed non-Basic Latin glyphs in device control commands

G. Branden Robinson Tue, 27 Sep 2022 14:15:55 -0700

Follow-up Comment #12, bug #63074 (project groff):

Hi Deri,


Thank you very much for the illuminating response.

[comment #11 comment #11:]
> The messages which started this bug: "special characters are not defined",
have very little to do with the message you recently suppressed.

I followed most of the rest of this message but I didn't see where anyone
suppressed anything!  Can you clarify?

[output comparison operator example snipped]

I love a good empirical test! :D

> Now to deal with why Cyrillic glyphs do not appear in the bookmark panel,
but do appear in the text of the document. The text is using the embedded
fonts which contain the Cyrillic glyphs mapped to appropriate code points. The
bookmark panel is using whatever system font you have configured for window
text. The system font will have Cyrillic glyphs but they will be using UTF
code points, not the 8-bit codes available to a type 1 postscript font.

Ouch!  This almost seems like a lack of foresight or i18n on the part of PDF
viewer programs...but perhaps not, as you address below.

> The pdf standard allows two encodings for strings in pdfs. We are using
PDFDocEncoding which is a superset of ISO Latin 1 and does not include
cyrillics. The alternative is UTF-16 (UTF-8 is not supported),

The Adobe/Microsoft axis will bedevil us forever if they get their way.

> the string must start with a BOM character, and this would allow any UTF
glyph to appear in bookmarks. The reason I used the 8-bit encoding is because
the groff .asciify command converts the \[UXXXX] back to ascii for me and as a
bonus dropped all other escapes from the string which could not be represented
as ascii. So a string such as "\fB\s'+2p'foo\s'-2p'\fP" would be converted to
"foo". The only niggle was the warning message (now suppressed) each time it
dropped a node such as "\fB".

Yes.  I don't know if we want to change the "asciify" request or add a
"sanitize" one, but either way there should be some means for the user to ask
troff to extract only the glyphs from a string/macro/diversion, i.e., with
_deliberate_ discard of everything else.  That is the behavior we need for
applications like this.

I regret the name "asciify"; it implies too much.  Like, you'll get only ASCII
_output_ as opposed to only groff's representations of characters.
 
> If I dropped the .asciify from pdf.tmac it would mean all the \[uXXXX]
strings would reach the post processor gropdf,

As-is?  Meaning they'd appear like 'x X blah\[uXXXX]blah'?  That's excellent,
and what I already proposed!

> which could then assemble a UTF-16 string from the hex numbers. As a proof
of concept I made some changes to pdf.tmac and gropdf and pdfmom -k -f U-T
mom-ru.mom produced the attached pdf. Still a fair bit to do, the biggest job
is to sanitise the string to remove unwanted escapes, convert any glyph
producing escapes such as \C and \N back to a UTF-16 character, and convert
basic latin characters to UTF-16. I suspect a deep dive into the asciify
routine in groff will be helpful.

People really shouldn't have to do this in macro packages.  String processing
in the roff language is incredibly tedious.  Saith I, after my experiences
with an*abbreviate-page-title, an*abbreviate-inner-footer, and
an*scan-for-backslash. 
https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/an.tmac

I've been contemplating adding a `sanitize` request for some time.

Regards,
Branden


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?63074>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[bug #63074] [troff] need a way to embed non-Basic Latin glyphs in device control commands

Reply via email to