[bug #66675] [troff] valid .char definition starting with `\[u` provokes erroneous error

G. Branden Robinson Thu, 23 Jan 2025 10:14:42 -0800

Follow-up Comment #5, bug #66675 (group groff):

At 2025-01-16T21:43:06-0500, G. Branden Robinson wrote:
> What you link to is some new/refactored code associated with my
> titanic struggle to land Unicode-rich PDF bookmarks (and device
> extension command arguments generally).
>
> At the point 'valid_unicode_code_sequence` is called, the caller
> _knows_ -- or is supposed to -- that it is expected a Unicode code
> sequence.  Possibly even a composite one.
[...]
> The problem, I suspect, is that I neglected to add logic to the
> character definition request handlers.  They should call that function
> only if they truly require a valid Unicode code point sequence, and, of
> course, they don't.  You can `.char \[unhappy] :-(` all day long as far
> as they're concerned.
>
> The action item for this ticket is to check all of
> `valid_unicode_code_sequence()`'s call sites.  Ensure that all are
> necessary and/or guarded by appropriate conditionals.


I looked into this and the root cause is both better and worse than I
expected.

The good news is that I didn't miss out any places were I should have
called `valid_unicode_code_sequence()` and didn't, or called it
when I shouldn't have.

The bad news is that the existing logic for interpreting all four
character definition requests (`char`, `fchar`, `schar`, and `fschar`)
has a common core that does something sneaky and lazy.

And it also explains something I had long wondered about.

Why do groff's character definition requests use the interpolation
syntax for the object they're defining?  This is unique in groff, or at
least inconsistent with macro, string, and register definition syntax,
all of which will be more familiar to most any user at any level of
expertise.

The sneaky and lazy thing that `do_define_character()` does is read its
first argument in interpretation, not copy, mode.  So the same logic
that is used to interpret a _defined_ character is _also_ used to set up
one that _isn't yet_ defined.

To certain minds that may sound great.  But it has weird consequences.

This bug is one.  Here's another.  I instrumented the formatter to
reveal how it's interpreting its input.


.char a A
troff:<standard input>:1: debug: GBR: (un)defined character 'char97'
.char \[foo] bar
troff:<standard input>:2: debug: GBR: (un)defined character 'foo'
.char \[foo\[bar]] baz
troff:<standard input>:3: error: ignoring invalid character definition;
expected one ordinary or special character to define, got character ']'
.char \[foo\|bar] baz
troff:<standard input>:4: error: '\|' is not allowed in an escape sequence
argument
troff:<standard input>:4: error: ignoring invalid character definition;
expected one ordinary or special character to define, got character 'a'


("(un)defined" refers to the fact that you can undefine a character in
the requests by simply omitting the second argument.  There's not a
separate control flow path for that, which is again clever, this time in
a good way.)

This stuff is kind of hard to introspect; I haven't yet written the
`pchar` request I've been contemplating.  (My goal is to disclose the
resolution process of the specified character's name and, if it's a
user-defined character, dump its contents using the same mechanism that
I have planned for strings and macros that I mean to add to `pm`, and
which is observable in rudimentary form in the existing [but pretty new]
`pline` request.)

I have a short-term plan, and a longer-term proposal.

Short-term
----------

Rewrite (what is named in my working copy) `define_character()` to read
the next argument (the first to `char` and friends) in copy mode.  That
means we have to parse it ourselves.  Sigh.  But that also means that
nodes can't sneak into the identifier unless we let them.  (I guess that
explains why the `b` in `baz` seems to vanish in the final example
above; the presence of the thin space node in a character identifier
deranges the formatter's internal state.  That's a guess.)

A big, big problem with interpretation mode that has caused grief on
multiple fronts is that there's only one place it happens--the gigantic
`token::next()` state machine.

But people will still be able to do clever things like define a special
character named `foobar` like this:

.ds f foo
.ds b bar
.char \*f\*b baz

...because `\*` is interpreted even in copy mode.

groff(7):

Escape sequence short reference
     The escape sequences \", \#, \$, \*, \?, \a, \e, \n, \t, \g, \V,
     and \newline are interpreted even in copy mode.

With this proposed change, I predict that the character definition of
`\[unhappy]` will work once again.  It's an acceptance criterion.

Long-term
---------

Yup--I want to reform the language again!

As noted above, the reuse of interpolation syntax for definition syntax
is anomalous in groff.  Sure, it's clever if you love that feature of C
(about which, as I recall, Kernighan and Ritchie both expressed some
doubts), but it's inconsistent both with other definition operations in
*roff and with how special character identifiers are represented in font
description files.

In the latter, the special character `\[foo]` is spelled `\foo`.  And
that's a good representation form for the planned `pchar` request, too.
A thing to note is that the `\` in `\foo` is not the *roff escape
character.  It's just a backslash.

Both syntaxes will have to be accepted for a while, of course.

I wonder how many people actually do their own character definitions.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66675>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66675] [troff] valid .char definition starting with `\[u` provokes erroneous error

Reply via email to