Follow-up Comment #4, bug #66675 (group groff):

At 2025-01-16T21:05:03-0500, Dave wrote:
> Follow-up Comment #2, bug #66675 (group groff):
>
> [comment #1 comment #1:]
>> \[u...] means an unicode character.
>
> It was permitted to mean other things for the past at least 20 years
> and I bet a lot longer.

I've always wondered why they weren't spelled exactly as Unicode does,
like `\[U+002D]`, but that's water under the bridge.  groff's naming
scheme for Unicode code points indeed has a long history.

> $ nroff --version
> GNU nroff (groff) version 1.19.2
> $ printf '.char \[unhappy] :-(\nI feel \[unhappy] today.\n' | nroff | cat -s
> I feel :‐( today.
>
> If this is an intentionally breaking change (a la bug #66673), it
> needs to be documented as such.

Nope, not intentional.  Getting back to the original report...

At 2025-01-16T16:54:20-0500, Dave wrote:
> Date: Thu 16 Jan 2025 03:54:17 PM CST By: Dave <barx>
> From at least groff 1.19.2 through 1.23, this command produced nothing
> on stdout or stderr:
>
> echo '.char \[unhappy] :-(' | groff
>
> This is as it should be: the code merely defines a perfectly
> legitimate character and does nothing with it.

Agreed.

> The latest groff build produces two diagnostics on stderr:
>
> troff:<standard input>:1: error: special character 'unhappy' is
> invalid: Unicode special character sequence has non-hexadecimal digit
> 'n'
>
> troff:<standard input>:1: error: bad character definition

Uh-oh.  Mea culpa.

> This erroneous error was introduced sometime after August 11.  I blame an
> overzealous [http://git.savannah.gnu.org/cgit/groff.git/commit/?id=d29abf70a
> commit d29abf70a].

That's near to but not exactly where I'd place the blame.

What you link to is some new/refactored code associated with my titanic
struggle to land Unicode-rich PDF bookmarks (and device extension
command arguments generally).

At the point 'valid_unicode_code_sequence` is called, the caller _knows_
-- or is supposed to -- that it is expected a Unicode code sequence.
Possibly even a composite one.

groff_char(7):
       Unicode code points can be composed as well; when they are, GNU
       troff requires NFD (Normalization Form D), where all Unicode
       glyphs are maximally decomposed.  (Exception: precomposed
       characters in the Latin‐1 supplement described above are also
       accepted.  Do not count on this exception remaining in a future
       GNU troff that accepts UTF‐8 input directly.)  Thus, GNU troff
       accepts “caf\['e]”, “caf\[e aa]”, and “caf\[u0065_0301]”,
as ways
       to input “café”.  (Due to its legacy 8‐bit encoding
       compatibility, at present it also accepts “caf\[u00E9]” on ISO
       Latin‐1 systems.)

       \[ubase‐char[_combining‐component]...]
              constructs a composite glyph from Unicode numeric special
              character escape sequences.  The code points of the base
              glyph and the combining components are each expressed in
              hexadecimal, with an underscore (_) separating each
              component.  Thus, \[u006E_0303] produces “ñ”.

The problem, I suspect, is that I neglected to add logic to the
character definition request handlers.  They should call that function
only if they truly require a valid Unicode code point sequence, and, of
course, they don't.  You can `.char \[unhappy] :-(` all day long as far
as they're concerned.

The action item for this ticket is to check all of
`valid_unicode_code_sequence()`'s call sites.  Ensure that all are
necessary and/or guarded by appropriate conditionals.



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66675>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

Attachment: signature.asc
Description: PGP signature

Reply via email to