[bug #63074] [troff] support construction of arbitrary byte sequences in device control commands

G. Branden Robinson Sun, 14 Jan 2024 14:06:12 -0800

Follow-up Comment #25, bug#63074 (group groff):

[comment #24 comment #24:]
> Bug #64484 is marked as fixed.


Right, but I believe there was a relationship nevertheless.

> I already have a reliable way to pass byte sequences in device control
commands, .stringhex.

Okay.  But it didn't do anything about this failing test case (which
admittedly didn't exist until I started to research this issue).

https://git.savannah.gnu.org/cgit/groff.git/diff/src/roff/groff/tests/device-control-special-character-handling.sh?id=974c063f0a9e1ef6c0d2cac4755a3b9d6e925b0d

Of which the salient part is the actual test input:


input='.nf
\X#bogus1: esc \%man-beast\[u1F63C]\\[u1F00] -\[aq]\[dq]\[ga]\[ha]\[rs]\[ti]#
.device bogus1: req \%man-beast\[u1F63C]\\[u1F00]
-\[aq]\[dq]\[ga]\[ha]\[rs]\[ti]
.ec @
@X#bogus2: esc @%man-beast@[u1F63C]@@[u1F00] -@[aq]@[dq]@[ga]@[ha]@[rs]@[ti]#
.device bogus2: req @%man-beast@[u1F63C]@@[u1F00]
-@[aq]@[dq]@[ga]@[ha]@[rs]@[ti]'


...which looks pretty noisy but tests several things.

1.  Use of \X escape sequences versus `device` requests.
2.  Use of \% escape sequences in device control commands (do they get
removed?).
3.  Use of ordinary hyphens in device control commands (do they get converted
to some crazy Unicode thing?).
4.  Use of special character escape sequences to represent ASCII characters in
device control commands and which should therefore be passed through as
ASCII.
5.  Robustness in the face of a changed roff escape character.  This did *not*
work prior to the bug #64484 fix.

> This bug was previously named "warning messages when using special
characters in TITLE or AUTHOR" and the attached cyrillic.pdf shows both the
pdf title and author shown with cyrillics and no warnings. So I would say this
one is dependent on bug #65098, i.e. merge the rest of my branch.

I hear your expression of urgency but I don't think "stringhex" is good
long-term solution to what ails us.  You are correct in comment #22 that I did
not correctly apprehend at first what it was for.  I thought you developed it
because we had no way to reliably transmit arbitrary byte sequences to device
control commands.  But we did, sort of--it just needed to be made consistent
and reliable.  That it wasn't is what my test case attempts to illustrate and
what the fix to bug #64484 attempts to prove.

No, I accept your premise that the main driver behind "stringhex" was this:

> The problem lies in the original pdfmark API, if you look at the pdfmark.pdf
you will see that in the sections describing .pdfhref M and .pdfhref L which
both refer to a "dest-name" and "descriptive text", it says that if a
dest-name is not given the first word in the description is used as the
dest-name.

I appreciate your explanation.  If the problem was with the pdfmark API, then
let's fix the pdfmark API.

In particular, this:

> if a dest-name is not given the first word in the description is used as the
dest-name

...strikes me a short-sighted, especially without any validation going on.  A
textual description of a hyperlink/bookmark might contain all sorts of crazy
stuff.  (Like Cyrillic or CJK characters or, worse, motion or type-size or
font-selection escape sequences.)  Assuming that it was going to be a
well-behaved sequence of ASCII bytes or even that one could "sanitize" or
"cln" one's way through was a hopeless notion.  That won't be practical until
we have a string iterator and more conditional expressions that enable the
user of an iterator to identify the type of each item in an iterated
string/macro/diversion.  But if I understand you correctly, we don't need that
fancy new stuff to solve the present problem, with stringhex or without.

It would probably benefit me to look up Peter's documentation on _mom_'s
"HEADING" macro.  It is a bit baffling to me that one has to repeat arguments
like this:


.HEADING 1 NAMED Гуляйпольщина "Гуляйпольщина"
...
.PDF_LINK Гуляйпольщина PREFIX ( SUFFIX ) "see: +"


> Where the "+" is replaced by the contents of the string register
pdf:look(Гуляйпольщина), which would actually be a string of
\[uXXXX] nodes, so would generate an error. This is what stringhex is for, to
hide the contents so that groff does not see it as a sequence of nodes. The
ideal solution would be to allow string registers to have an attribute (say
"glass") which signals that groff should never try to interpret its contents,
i.e. operate as if the escape mechanism was turned off just for the contents
of that register, and have a way of turning that attribute on/off or an escape
which sets the attribute for the enclosed string.

Right now I don't understand why we would need to elaborate a fairly
fundamental *roff language data type (the string) with a "glass" attribute
when, if you have a list indexed by a number or a _valid_ identifier, you can
simply define a string using a list item's index as a prefix.


.nr refno 1
.de DEFREF
.  nr refno +1
.  ds ref*id!\n[refno]!tag \\$1
.  ds ref*id!\n[refno]!author \\$2
.  ds ref*id!\n[refno]!desc \\$3
.  ds ref*id!\n[refno]!year \\$4
..
.DEFREF story "Dupr\[e aa]" "Best \%Story\%Book Ever" 1989


That's a simplified example of how macro packages have been implementing
arrays of data structures for decades, complete with idioms for "*" and "!",
which are not imposed by the language in any way.  Maybe I'm missing
something.

As it happens, this bug is probably fixed, too--I simply need to come up with
a convincing acceptance criterion for it.  A bit tough without adding a
feature to an existing output driver.  I trust it's obvious that, with
appropriate escaping, one can transmit "\000\001..\377" or "\x00..\xff" or
"\[u0000]..\[u00FF]".

I will try to make some time to reply to comment #22 more thoughtfully soon. 
Leaving in "Need Info" status and assigned to myself for that reason.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?63074>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[bug #63074] [troff] support construction of arbitrary byte sequences in device control commands

Reply via email to