Follow-up Comment #5, bug #66675 (group groff): At 2025-01-16T21:43:06-0500, G. Branden Robinson wrote: > What you link to is some new/refactored code associated with my > titanic struggle to land Unicode-rich PDF bookmarks (and device > extension command arguments generally). > > At the point 'valid_unicode_code_sequence` is called, the caller > _knows_ -- or is supposed to -- that it is expected a Unicode code > sequence. Possibly even a composite one. [...] > The problem, I suspect, is that I neglected to add logic to the > character definition request handlers. They should call that function > only if they truly require a valid Unicode code point sequence, and, of > course, they don't. You can `.char \[unhappy] :-(` all day long as far > as they're concerned. > > The action item for this ticket is to check all of > `valid_unicode_code_sequence()`'s call sites. Ensure that all are > necessary and/or guarded by appropriate conditionals.
I looked into this and the root cause is both better and worse than I expected. The good news is that I didn't miss out any places were I should have called `valid_unicode_code_sequence()` and didn't, or called it when I shouldn't have. The bad news is that the existing logic for interpreting all four character definition requests (`char`, `fchar`, `schar`, and `fschar`) has a common core that does something sneaky and lazy. And it also explains something I had long wondered about. Why do groff's character definition requests use the interpolation syntax for the object they're defining? This is unique in groff, or at least inconsistent with macro, string, and register definition syntax, all of which will be more familiar to most any user at any level of expertise. The sneaky and lazy thing that `do_define_character()` does is read its first argument in interpretation, not copy, mode. So the same logic that is used to interpret a _defined_ character is _also_ used to set up one that _isn't yet_ defined. To certain minds that may sound great. But it has weird consequences. This bug is one. Here's another. I instrumented the formatter to reveal how it's interpreting its input. .char a A troff:<standard input>:1: debug: GBR: (un)defined character 'char97' .char \[foo] bar troff:<standard input>:2: debug: GBR: (un)defined character 'foo' .char \[foo\[bar]] baz troff:<standard input>:3: error: ignoring invalid character definition; expected one ordinary or special character to define, got character ']' .char \[foo\|bar] baz troff:<standard input>:4: error: '\|' is not allowed in an escape sequence argument troff:<standard input>:4: error: ignoring invalid character definition; expected one ordinary or special character to define, got character 'a' ("(un)defined" refers to the fact that you can undefine a character in the requests by simply omitting the second argument. There's not a separate control flow path for that, which is again clever, this time in a good way.) This stuff is kind of hard to introspect; I haven't yet written the `pchar` request I've been contemplating. (My goal is to disclose the resolution process of the specified character's name and, if it's a user-defined character, dump its contents using the same mechanism that I have planned for strings and macros that I mean to add to `pm`, and which is observable in rudimentary form in the existing [but pretty new] `pline` request.) I have a short-term plan, and a longer-term proposal. Short-term ---------- Rewrite (what is named in my working copy) `define_character()` to read the next argument (the first to `char` and friends) in copy mode. That means we have to parse it ourselves. Sigh. But that also means that nodes can't sneak into the identifier unless we let them. (I guess that explains why the `b` in `baz` seems to vanish in the final example above; the presence of the thin space node in a character identifier deranges the formatter's internal state. That's a guess.) A big, big problem with interpretation mode that has caused grief on multiple fronts is that there's only one place it happens--the gigantic `token::next()` state machine. But people will still be able to do clever things like define a special character named `foobar` like this: .ds f foo .ds b bar .char \*f\*b baz ...because `\*` is interpreted even in copy mode. groff(7): Escape sequence short reference The escape sequences \", \#, \$, \*, \?, \a, \e, \n, \t, \g, \V, and \newline are interpreted even in copy mode. With this proposed change, I predict that the character definition of `\[unhappy]` will work once again. It's an acceptance criterion. Long-term --------- Yup--I want to reform the language again! As noted above, the reuse of interpolation syntax for definition syntax is anomalous in groff. Sure, it's clever if you love that feature of C (about which, as I recall, Kernighan and Ritchie both expressed some doubts), but it's inconsistent both with other definition operations in *roff and with how special character identifiers are represented in font description files. In the latter, the special character `\[foo]` is spelled `\foo`. And that's a good representation form for the planned `pchar` request, too. A thing to note is that the `\` in `\foo` is not the *roff escape character. It's just a backslash. Both syntaxes will have to be accepted for a while, of course. I wonder how many people actually do their own character definitions. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?66675> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature