Follow-up Comment #4, bug #66675 (group groff): At 2025-01-16T21:05:03-0500, Dave wrote: > Follow-up Comment #2, bug #66675 (group groff): > > [comment #1 comment #1:] >> \[u...] means an unicode character. > > It was permitted to mean other things for the past at least 20 years > and I bet a lot longer.
I've always wondered why they weren't spelled exactly as Unicode does, like `\[U+002D]`, but that's water under the bridge. groff's naming scheme for Unicode code points indeed has a long history. > $ nroff --version > GNU nroff (groff) version 1.19.2 > $ printf '.char \[unhappy] :-(\nI feel \[unhappy] today.\n' | nroff | cat -s > I feel :‐( today. > > If this is an intentionally breaking change (a la bug #66673), it > needs to be documented as such. Nope, not intentional. Getting back to the original report... At 2025-01-16T16:54:20-0500, Dave wrote: > Date: Thu 16 Jan 2025 03:54:17 PM CST By: Dave <barx> > From at least groff 1.19.2 through 1.23, this command produced nothing > on stdout or stderr: > > echo '.char \[unhappy] :-(' | groff > > This is as it should be: the code merely defines a perfectly > legitimate character and does nothing with it. Agreed. > The latest groff build produces two diagnostics on stderr: > > troff:<standard input>:1: error: special character 'unhappy' is > invalid: Unicode special character sequence has non-hexadecimal digit > 'n' > > troff:<standard input>:1: error: bad character definition Uh-oh. Mea culpa. > This erroneous error was introduced sometime after August 11. I blame an > overzealous [http://git.savannah.gnu.org/cgit/groff.git/commit/?id=d29abf70a > commit d29abf70a]. That's near to but not exactly where I'd place the blame. What you link to is some new/refactored code associated with my titanic struggle to land Unicode-rich PDF bookmarks (and device extension command arguments generally). At the point 'valid_unicode_code_sequence` is called, the caller _knows_ -- or is supposed to -- that it is expected a Unicode code sequence. Possibly even a composite one. groff_char(7): Unicode code points can be composed as well; when they are, GNU troff requires NFD (Normalization Form D), where all Unicode glyphs are maximally decomposed. (Exception: precomposed characters in the Latin‐1 supplement described above are also accepted. Do not count on this exception remaining in a future GNU troff that accepts UTF‐8 input directly.) Thus, GNU troff accepts “caf\['e]”, “caf\[e aa]”, and “caf\[u0065_0301]”, as ways to input “café”. (Due to its legacy 8‐bit encoding compatibility, at present it also accepts “caf\[u00E9]” on ISO Latin‐1 systems.) \[ubase‐char[_combining‐component]...] constructs a composite glyph from Unicode numeric special character escape sequences. The code points of the base glyph and the combining components are each expressed in hexadecimal, with an underscore (_) separating each component. Thus, \[u006E_0303] produces “ñ”. The problem, I suspect, is that I neglected to add logic to the character definition request handlers. They should call that function only if they truly require a valid Unicode code point sequence, and, of course, they don't. You can `.char \[unhappy] :-(` all day long as far as they're concerned. The action item for this ticket is to check all of `valid_unicode_code_sequence()`'s call sites. Ensure that all are necessary and/or guarded by appropriate conditionals. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?66675> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature