[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

G. Branden Robinson Sun, 23 Mar 2025 01:39:29 -0700

Follow-up Comment #19, bug #66919 (group groff):

At 2025-03-22T14:01:14-0400, Dave wrote:
> [comment #16 comment #16:]
>> Follow-up Comment #14, bug #66919 (group groff):
>>> But it seems that this mechanism to clear a code needs to carve
>>> out an exception for what I'm terming "reflexive hcode," right?
>>
>> I think it has one.
>
> Call me old-fashioned, but I prefer to go by how the code behaves
> rather than by what it says.


They're the same thing--given sufficient understanding.  ;-)

(Which I'm sometimes lacking...)

> And the results of running the hcode_test in the [comment #0 original
> submission] tell us that "reflexive hcode" assigns a new, unique
> hyphenation code (what it has always done) when the first parameter is
> Latin-1, but not when the first parameter is a special character--even
> the special character representing the same character that works in
> its Latin-1 form.
>
>>> Because the purpose of [reflexive hcode] is to generate a new
>>> hyphenation code for a character that may never have had one.
>>
>> ...or reset one to its virgin state after meddling.
>
> Sorry, I'm not understanding how ".hcode x x" can ever reset a
> hyphenation code.

I missed out the correct context, which you helpfully clarified with
brackets.

> It can either do what it has always done, which is to generate a new,
> unique hyphenation code;

I quibble with the word "generate" here.  Presently, there are 256
possible hyphenation codes, 0-255, because they are `unsigned char`s.

Hyphenation code assignment, reflexive or not, updates the surjective
relation of character codes to hyphenation codes.

Since "surjective" is heavy math jargon, let me explain.

The character codes and hyphenation codes are both drawn from the exact
same set of possible values, 0-255.

Hyphenation code assignment creates a relation, or map, from one to the
other.  We therefore say that the "domain" (of character codes) and
"codomain" (of hyphenation codes) have the same elements.

However, while all character codes in the domain are in use (GNU troff
deals with them one way or another, and has semantics for nearly all),
in a typical run of the formatter the same is _not_ true of the
codomain.

Most possible values of the hyphenation code are not in use.  Most
character codes map to hyphenation codes of zero.  GNU troff's
convention is to map the hyphenation codes of uppercase letters to their
lowercase counterparts.

Observe:


$ echo '.pchar aA' | groff 2>&1 | grep -E '(char|hyph)'
character 'a'
  hyphenation code: 97
character 'A'
  hyphenation code: 97


The _character_ code of 'A' is 65.  I can prove it!


$ printf '.hcode A A\n.pchar aA\n' | groff 2>&1 | grep -E '(char|hyph)'
character 'a'
  hyphenation code: 97
character 'A'
  hyphenation code: 65


Thus, given a hyphenation code value, we don't necessarily know which
_character_ code its character bears.

Equivalently, the relation between character and hyphenation codes is
_not_ one-to-one.  The relation is _surjective_.  Some people use the
term "onto" for this kind of relation.  (I think the codomain has to
also be a [not necessarily proper] subset of the domain for this usage
to be correct.)

> or it can "reset" the hyphenation code of character x to... the value
> it already has.

Here's where I disagree with your model.  I'd have to dig deeper into
the startup code to be sure, but _conceptually_, when GNU troff
initializes hyphenation codes, they're a perfect image of the character
code values.  It then goes through each of them and "fixes them up".
Most hyphenation codes get set to zero, the uppercase letters get set to
that of their lowercase counterparts, and the lowercase letters are left
as-is.

Later, when troffrc is read, hyphenation codes may (typically do) get
updated.

So I would say that this:


$ printf '.hcode A A\n.pchar aA\n' | groff 2>&1 | grep -E '(char|hyph)'
character 'a'
  hyphenation code: 97
character 'A'
  hyphenation code: 65


...illustrates neither "generation" of a (presumptively new) hyphenation
code _nor_ "reset" of "A"'s hyphenation code to "the value it already
has".  It _was_ 97, and we knocked it back to 65.

It would be fair to complain that I might have infected your imagination
with the notion of generated hyphenation codes, because I did so to
myself in bug #66051.

"We have a lot more integers available to us than just 0-255," I
thought.  Wouldn't it be neat if we could generate a new one for
any special character that wanted one, starting at 256, say?"

That idea and plan hit the wall, hard.  See comment #7 of bug #66051.

We are confined to the `unsigned char` straitjacket for now.

> OK, but punting the question implies restoring the .hcode behavior to
> its 1.23 state.

Not if we can argue that the aspect of the behavior you observe is
undefined or otherwise undocumented.

> That would be the more conservative approach,

In a sense.  But generally in language development, if you break
someone's reliance on undefined behavior, you don't owe them an apology
and possibly not even notice.  (In this case, I don't object to "notice"
unless it makes the "NEWS" file "too long"--an assessment I personally
am unlikely to make.)

> but it's overkill for resolving this ticket, which is concerned only
> with the combination of reflexive hcode, and the first character being
> a special character.

I concur.

> I'd change the Summary to specify this ticket is limited only to
> reflexive hcode calls if that were an accepted term rather than one I
> invented in the course of talking about this issue.

I have trouble interpreting


.hcode \[~o] õ


as "reflexive" because in the part of GNU troff we're talking about, the
source and destination characters are distinguishable.  We can observe
that fact by inspecting how `pchar` talks about them, even ignoring the
reported hyphenation code values.

I concede that other aspects of GNU troff's input processing, namely the
canned set of special character definitions that completely covers the
ISO Latin-1 character set, make the issue muddier.

Stepping outside of ISO Latin-1 might clarify.


.hcode \[b=] б


You might not assume, absent knowledge that English speakers find
specialized, that the foregoing is a "reflexive" use of `hcode`.

In fact GNU troff can, for 1.24.0, just about get away with this, thanks
to the new "tmac/koi8-r.tmac" file.

https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/koi8-r.tmac

But someone unconcerned with Cyrillic, and not using it in their
document, might want to define a special character named `b=` and to
feel free either to assume that its hyphenation code will default to
zero, or that they can assign it at liberty.  Maybe they want this:


.char \[b=] \o'b='\" I want a special B for names in my fantasy novel...
.hcode \[b=] b\" ...but it's still a kind of "b".


...are they wrong?

That's why I say that, for English, you cannot assume that `\[o~]` is
going to behave just like `õ`.  The latter is _not defined in the
English alphabet_, so you can't rely on its hyphenation code having any
particular value.

Running _groff_ in a Latin-1 terminal, I get this:


$ groff
.pchar õ
character code 245 (U+00F5)
  is translated
  does not have a macro
  special translation: 0
  hyphenation code: 0
  flags: 0
  ASCII code: 245
  asciify code: 0
  is found
  is transparently translatable
  is not translatable as input
  mode: normal


Running _groff_ in a KOI8-R terminal, I get this...

(...er, uh, one trip through "dpkg-reconfigure locales" later...)

I can't even input the character, because õ doesn't exist in that
character encoding.  But I can stab at it anyway.


$ printf '.pchar \365\n' | groff
character code 245 (U+00F5)
  is translated
  does not have a macro
  special translation: 0
  hyphenation code: 0
  flags: 0
  ASCII code: 245
  asciify code: 0
  is found
  is transparently translatable
  is not translatable as input
  mode: normal


"They're the same picture." -- Pam from _The Office_ (U.S.)

Yes.  And no.  But one thing's for sure, in the KOI8-R locale:


.hcode \[~o] õ


...is **not** reflexive, because in that locale it really looks like
this:


.hcode \[~o] Т \" CYRILLIC CAPITAL LETTER TE


>> The question I have for you right now is whether _groff_ master is
>> working as you expect and desire specifically for Latin-1 characters
>> whose hyphenation codes we configure in "en.tmac",
>
> Yes.  I have no quarrel with this, and if I did, it would be out of
> scope for this ticket.

Okay.

>> I also want to know whether spelling them as 8-bit ordinary
>> characters or special characters works as you expect.
>
> Ordinary characters, yes.  Special characters, in reflexive context,
> no.

I claim that, in _groff_, special characters _cannot_, and have never
been able to, participate in reflexive hyphenation code assignments.

Have I persuaded you?



    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66919>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

Reply via email to