[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

G. Branden Robinson Wed, 26 Mar 2025 03:21:24 -0700

Follow-up Comment #26, bug #66919 (group groff):

Okay, I told you to hold your horses in comment #21, but like me, you have a
loose grip on the reins when galloping through Esoteric Argument Pass.


[comment #23 comment #23:]
> Now we get to where my conceptual groundwork of comment #20 starts
> interacting with concrete examples.
> 
> [comment #19 comment #19:]
>> So I would say that this:
>> 
>> $ printf '.hcode A A\n.pchar aA\n' | groff 2>&1 | grep -E '(char|hyph)'
>> [snip]
>> 
>> ...illustrates neither "generation" of a (presumptively new) hyphenation
>> code
> 
> No character previously had a hyphenation code of 65, so groff has,
> conceptually, generated one.  Your input above tells groff, "I want the
> character 'A' to be considered a potential hyphenation point, and to be
> considered not equivalent to any existing characters with hyphenation
> codes."

No, that's not what that means.

If I have this sequence:


.hcode A A
.hcode B A
.hcode A a
.hcode A A


...then my second invocation of `hcode A A` *does* in fact make its
hyphenation code equivalent to that of an existing character: "B".  This
possibility is one reason I indulged the seemingly tedious digression into
surjective relations.  We're not dealing with pointers, references, or other
forms of indirection here, as popular as they are in programming.  What we
have is a translation table from character codes (in the case of ordinary
characters) to hyphenation codes.

`tr` works similarly.  When we update the map of "from" to "to" in a pairing,
the request pays no heed to the existing mapping of "from".

If I have:


.tr aZbZcZ


...no _*roff_ forgets how to distinguish "a", "b", and "c" from each other.
It knows, and we can therefore undo the equivalence relation.


.tr aabbcc


Other operations, like request deletion and special-built-in-register
deletion, _do_ work ultimately with references of a kind, and are therefore
irreversible.  When the only reference to a thing is discarded, it is gone
forever.  (Until a new GNU _troff_ process is constructed afresh.)

> This matches the typical user's (i.e., one who isn't peering into the
> implementation) understanding of what it means to generate a new hyphenation
> code.
> 
> Your own words, in fact, betray your view being colored by knowledge of
> formatter internals:
> 
>> It _was_ 97, and we knocked it back to 65.
> 
> It was, in no sense meaningful to a user, knocked _back_ to anything, as it
> never had 65 as a hyphenation code.  You happen to know James Clark's
> initialization algorithm, but even that is arbitrary: Clark could just as
> easily have initialized "A" and "a" to the same value without going through
> his set-it-then-change-it routine, at which point the "back" in your sentence
> becomes an actual and not merely conceptual misstatement.

That's true, but he didn't.  And I plead guilty to coloration by knowledge of
GNU _troff_ internals.  It's been a lot of effort to acquire what I have,
which isn't enough to suit me.

>> I say that, for English, you cannot assume that `\[o~]` is
>> going to behave just like `õ`.  The latter is _not defined in the
>> English alphabet_, so you can't rely on its hyphenation code having any
>> particular value.
> 
> That's a fair statement.  But even though I'm running groff with its default
> startup (English) files, the behavior I'm talking about in this ticket is in
> the formatter, not in any startup files.  What I'm talking about has nothing
> to do with the input _language_ and everything to do with input _encoding_.
> (You'll notice that I'm not providing any sample input with any English
> words.  The two words I've used, lanteronial, and lanterõnial--and then only
> to work around the lack of .pchar in older groffs--aren't part of any
> language that I'm aware of.  So I'm talking about general formatter behavior,
> independent of any language setting.)

Now that I think I've root-caused this, I think I can pretty much agree with
that.
 
> In all other respects, groff treats \[o~] and the Latin-1 õ as the same
> character.

That, I think, is too bold a claim.  The only reason _groff_ does *that*,
ever, is because "latin1.tmac" sets it up.

https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/latin1.tmac?h=1.23.0#n96

This hasn't changed in HEAD.

And of course in other encodings, "char245" is the same as something else.

https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/latin2.tmac?h=1.23.0#n224

> If you want to make .hcode treat them as different, the documentation should
> clearly highlight this difference.

I won't rule out that we can improve our documentation of hyphenation codes,
but we should reach a meeting of the minds on the topic of this ticket first.
 
> But I suspect that, once you start trying to explain to readers why ".hcode
> \[o~] õ" and ".hcode õ õ" behave differently in Latin-1 input, you might
> start to think that's not actually such a wise thing for groff to do.

I don't agree, because whether they _should_ be equivalent depends on whether
you're writing in English.  Although, if you are, even _groff_ HEAD treats
them the same for hyphenation purposes: they both get a big fat zero.

>> But one thing's for sure, in the KOI8-R locale:
>> 
>> .hcode \[~o] õ
>> 
>> ...is **not** reflexive, because in that locale it really looks like
>> this:
>> 
>> .hcode \[~o] Т \" CYRILLIC CAPITAL LETTER TE
> 
> Right, the meaning of any character with the 8th bit set depends on the input
> encoding.  That's true whether we're talking about .hcode or any other part
> of groff input.  So this point is not really relevant to .hcode itself.

But it is!  From _groff_ 1.23.0 back to the Dawn of Man, employing `hcode`
using an ordinary character with its 8th bit set as the source could create a
"new" hyphenation code!
 
> My examples are all Latin-1 input, which I've tried to be clear about (in the
> comment #0 example, by running "file" on the input; in the comment #4 one, by
> stating the encoding before presenting the input file).  If groff offered a
> special character for CYRILLIC CAPITAL LETTER TE, I'm certain that using that
> with the above .hcode example in KOI8-R encoding would reveal the same
> behavior.

You've been crystal clear about which encoding you're using.  I confess I'm
being a stickler that Latin-1 doesn't necessarily mean English, and, per our
discussion in bug #66112, there are accented Latin letters that do not play a
role in English orthography and therefore should not be assigned default
hyphenation codes by _groff_.

Moreover, if the English localization file doesn't assign a hyphenation code
one to a character, `hcode` should not be able to magic one up out of thin
air--unless the user _explicitly_ asks for reflexive code assignment.  The
user can then knock themselves out setting up special character synonyms.  By
hand, by loading "rfc1345.tmac", or what-have-you.
 
>> I claim that, in _groff_, special characters _cannot_, and have never
>> been able to, participate in reflexive hyphenation code assignments.
> 
> I crafted the example in comment #4 precisely to show the change between
> older groffs and the current one when using a special character in a
> reflexive hcode.

Here's another clash of terminology we have.  I assert that:


.hcode \[~o] õ


...*is not a reflexive hyphenation code assignment*.

It can't be, because the meaning of "õ" depends on the character encoding in
a way that "\[~o]" does not.

I agree with you that


.hcode õ õ


...where õ is _any ordinary character_, is a reflexive hyphenation code
assignment.  But I claim that the source and destination have to be
input-character-identical for it to be the case.


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66919>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

Reply via email to