[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

G. Branden Robinson Wed, 26 Mar 2025 02:23:13 -0700

Update of bug #66919 (group groff):

                  Status:                    None => Need Info
             Assigned to:                    None => barx


    _______________________________________________________

Follow-up Comment #24:

[comment #23 comment #23:]
> Now we get to where my conceptual groundwork of comment #20 starts
> interacting with concrete examples.

Good, 'cause I have some too!  :D

First let me start with this illustrator.


$ printf '.ll 3n\ndomain\n' | groff -a -Wbreak
<beginning of page>
do<hy>
main


GNU _troff_ will break in the same place any word with a letter equivalent to
"o" in the same place.


$ printf '.ll 3n\nd\[`o]main\n' | groff -a -Wbreak
<beginning of page>
d<`o><hy>
main


Recalling from our discussion in bug #66112, and my selection of your first
suggestion over your second, o-with-tilde-accent is _not_ equivalent to "o" in
English, so it shouldn't break...


$ printf '.ll 3n\nd\[~o]main\n' | groff -a -Wbreak
<beginning of page>
d<~o>main


...and indeed it doesn't.

That established...

> That's a fair statement.  But even though I'm running groff with its default
> startup (English) files, the behavior I'm talking about in this ticket is in
> the formatter, not in any startup files.  What I'm talking about has nothing
> to do with the input _language_ and everything to do with input _encoding_.

I agree!

> (You'll notice that I'm not providing any sample input with any English
> words.  The two words I've used, lanteronial, and lanterõnial--and then only
> to work around the lack of .pchar in older groffs--aren't part of any
> language that I'm aware of.  So I'm talking about general formatter behavior,
> independent of any language setting.)

But you're not talking about _general_ formatter behavior, you're talking
about formatter behavior **after the "latin1.tmac" file is loaded**.

Observe.


$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak
<beginning of page>
d<~o>main


Let's try it with the "raw" o with tilde accent character, Latin-1 245 decimal
(365 octal).


$ printf '.ll 3n\nd\365main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak
<beginning of page>
/home/branden/groff-HEAD/bin/troff:<standard input>:2: warning: character with
input code 245 not defined
dmain
$ printf '.ll 3n\nd\365main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak
<beginning of page>
/home/branden/groff-1.23.0/bin/troff:<standard input>:2: warning: character
with input code 245 not defined
dmain
$ printf '.ll 3n\nd\365main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak
<beginning of page>
/home/branden/groff-1.22.4/bin/troff: <standard input>:2: warning: can't find
character with input code 245
dmain
$ printf '.ll 3n\nd\365main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak
<beginning of page>
<standard input>:2: warning: can't find character with input code 245
dmain


This makes sense, because in all released versions of _groff_, the formatter
doesn't yet know, before loading startup files, whether it's going to be
operating in a Latin-1 or EBCDIC (code page 1047) environment.  (Well,
technically it *can* know just by checking the character code of, say, "a",
but it stays as agnostic as it can and lets macro files do most of the
lifting.)

Let's macro-load "latin1.tmac" in our examples and see if that changes
anything.


$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-HEAD/bin/troff -Ra
-Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.23.0/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.22.4/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.22.3/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main


The character code is now recognized, and translated on input (`trin`) to the
special character `~o`.  But it still doesn't hyphenate.

For completeness, let's see if explicitly specifying the special character
changes behavior.


$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main
$ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff
-Ra -Wbreak
<beginning of page>
d<~o>main


Still no.  Finally let's load "en.tmac", which didn't exist prior to 1.23.0.


$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff -Ra
-Wbreak
<beginning of page>
d<~o>main
$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff -Ra
-Wbreak
<beginning of page>
d<~o>main
$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff -Ra
-Wbreak
/home/branden/groff-1.22.4/bin/troff: <standard input>:1: warning: can't find
macro file 'en.tmac'
<beginning of page>
d<~o>main
$ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff -Ra
-Wbreak
<standard input>:1: warning: can't find macro file `en.tmac'
<beginning of page>
d<~o>main


So here are a bunch more cases where formatter behavior doesn't change, all
using the same special character you've chosen.


$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/groff -a -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/groff -a -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/groff -a -Wbreak
<beginning of page>
d<~o>main
$ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/groff -a -Wbreak
<beginning of page>
d<~o>main


Why does "lanteronial" (not an English word) hyphenate differently from
"domain" (definitely an English word)?  To answer that requires a source dive,
which is coming shortly.  But first, I must ask:

So the hyphenation of a non-English word using a letter that doesn't exist in
the English alphabet has changed from _groff_ 1.23.0 to (what will become)
1.24.0.  Is it fair to call that a regression?

I think you've identified a relatively dusty crevice in a corner case, and
that it arises solely due a presumption that was being made in
`set_hyphenation_code()` for many years.

So why did commit a52141ac46eef95dd1f85e4c2e0a336affa9bcc9 change things?

Let's look at the diff again.


diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp
index cc7d9dd71..946b93570 100644
--- a/src/roff/troff/input.cpp
+++ b/src/roff/troff/input.cpp
@@ -7309,25 +7309,26 @@ static void set_hyphenation_codes()
       error("cannot use the hyphenation code of a numeral");
       break;
     }
-    unsigned char new_code = 0; // TODO: int
+    unsigned char new_code = 0;
     charinfo *cisrc = tok.get_char();
-    if (csrc != 0)
-      new_code = csrc;
-    else {
+    if (cisrc != 0 /* nullptr */)
+      // Common case: assign destination character the hyphenation code
+      // of the source character.
+      new_code = cisrc->get_hyphenation_code();
+    if (0 == csrc) {
       if (0 /* nullptr */ == cisrc) {
        error("expected ordinary or special character, got %1",
              tok.description());
        break;
       }
-      // source character is special
-      if (0 == cisrc->get_hyphenation_code()) {
-       error("second member of hyphenation code pair must be an"
-             " ordinary character, or a special character already"
-             " assigned a hyphenation code");
-       break;
-      }
       new_code = cisrc->get_hyphenation_code();
     }
+    else {
+      // If assigning a ordinary character's hyphenation code to itself,
+      // use its character code point as the value.
+      if (csrc == cdst)
+       new_code = tok.ch();
+    }
     cidst->set_hyphenation_code(new_code);
     if (cidst->get_translation()
        && cidst->get_translation()->get_translation_input())


...and at your test case (the UTF-8 version for readability in Savannah,
**not** bug-reproducibility).


$ cat EXPERIMENTS/lanteronial-utf8.groff 
.ll 1n
lanteronial
lanter\[~o]nial
.hcode \[~o] õ
lanter\[~o]nial


You've only got the one `hcode` invocation, so that's good.

What was its path through the old code?

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=89623d044a207c1321bdf106d5f8d5d9e59b7ca1#n7278

Well, we have a bunch of validity checking/error handling first.

Eventually, if we've got two (mostly) valid arguments, we end up on line
7312.

If `csrc` is not zero, the source character is "ordinary".  (If it _is_ zero,
it could be anything, like a horizontal motion escape sequence.  But in valid
cases, if it's zero it's a special or indexed character.)  And so that branch
should be taken for the "lanteronial" file.  `new_code` becomes its value
(7315) and we skip to 7331, where the `charinfo` of the destination character
is set to that value.

We then worry about whether the destination character is "translated" (which I
**think** refers to `tr` translation but I haven't ruled out `trin` or `trnt`
translations instead, because it seems that no good item of terminology should
be permitted to apply to only one concept in a program), if it is, that new
code is immediately superseded by that of its translation (7334).

Then the function ends.

Okay, what about _after_ the "bad commit"?

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=a52141ac46eef95dd1f85e4c2e0a336affa9bcc9#n7278

We start off again at line 7312.

We don't make a decision about `csrc` right away.  Instead we gather the
source character's hyphenation code immediately, if it has one (7314-7317),
then if the source character is special, we proceed as before (7319-7324).
But in this case, the source character is ordinary, so we check to see if the
character is being assigned to itself, and if so apply this "reflexive case"
(7329-7330).  But we won't take that branch either because the test on line
7329 will fail: `csrc` is 245 decimal, but `cdst` is 0 because it's a special
character.  We then hit line 7332 where we assign `new_code` to `cidst`.  But
remember line 7317.  `cisrc`'s hyphenation code would be zero, because because
that's the value it has when the formatter starts up ("troff -R"), and neither
"en.tmac" nor "latin1.tmac" ever assigned it a hyphenation code.

The bottom line is that there _is_ a logic change.  Before "bad commit",
`new_code` got populated presumptively with the character code of the source
character, **if the character was ordinary**.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=89623d044a207c1321bdf106d5f8d5d9e59b7ca1#n7315

In the new logic, it doesn't.  It didn't occur to me that that assumption was
warranted.  The character code might not be meaningful as a hyphenation code
in the language.  `set_hyphenation_code()` has, for many years, been
aggressively assuming that it was, if you had the audacity to use an ordinary
character as the source character (second argument) in an `hcode` request.

I'd say the "bad commit" is a bug fix.

So we might retitle this ticket "[troff] behavior change in some .hcode calls
when an ordinary character is the second argument", and you can guess what my
proposed resolution is.

But I want to hear your take.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66919>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

Reply via email to