[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

G. Branden Robinson Tue, 22 Apr 2025 20:34:12 -0700

Follow-up Comment #32, bug #66919 (group groff):

At 2025-04-16T17:45:11-0400, Dave wrote:
> Follow-up Comment #31, bug #66919 (group groff):
> 
> [comment #29 comment #29:]
>> [comment #27 comment #27:]
>>> Clearly, "can't" overstates this, because it _was_ regarded as one
>>> in past groffs.
>> 
>> ...unless you were on an EBCDIC system, or loaded "latin2.tmac".
>> 
>>> Additionally, in my example input, at the time that .hcode is run,
>>> the character encoding is known,
>> 
>> Not to the formatter.
> 
> Both these statements can't be true.  If it _was_ regarded as one in
> past groffs
[snip]
> Determining _which_ one is true is, once again, thwarted by past
> releases lacking a way to query hyphenation codes, and my attempts to
> deduce them through observations of hyphenation behavior have given me
> seemingly contradictory results.


Well, I went to the trouble of preparing a patch to add the erstwhile
`phcode` request to groff 1.23.0 (attached, or at least I'll try, and
hope Savannah's mail gateway gets it into the ticket tracker).

I quickly found out that we didn't set up hyphenation codes for most
"extended ASCII" character codes in 1.23.0, _except_ in the language
localization files!

Go ahead, look for 'em!  (We're gonna see an ocean of Unicode
replacement characters here, because the file of interest are not
encoded in UTF-8, but a variety of other encodings).


$ git describe
1.23.0
$ grep -a hcode tmac/*
tmac/LOCALIZATION:        for the locale are set with the .hcode request.
tmac/X.tmac:.  hcode \\$1\\$4
tmac/X.tmac:.hcode \[S ,]s
tmac/X.tmac:.hcode \[s ,]s
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/cs.tmac:.hcode � �  � �
tmac/de.tmac:.hcode � �  � �  � �  � �  � �  � �  �
�
tmac/de.tmac:.hcode � �
tmac/de.tmac:.hcode � �  � �  � �  � �
tmac/de.tmac:.hcode � �  � �  � �  � �
tmac/de.tmac:.hcode � �
tmac/de.tmac:.hcode � �  � �  � �  � �  � �
tmac/de.tmac:.hcode � �  � �  � �
tmac/de.tmac:.hcode � �  � �  � �  � �  � �  � �  �
�
tmac/de.tmac:.hcode � �
tmac/de.tmac:.hcode � �  � �  � �  � �
tmac/de.tmac:.hcode � �  � �  � �  � �
tmac/de.tmac:.hcode � �
tmac/de.tmac:.hcode � �  � �  � �  � �  � �
tmac/de.tmac:.hcode � �  � �  � �
tmac/de.tmac:.hcode � �
tmac/dvi.tmac:.  hcode \\$1\\$4
tmac/dvi.tmac:.hcode \[,C]c
tmac/dvi.tmac:.hcode \[,c]c
tmac/dvi.tmac:.hcode \[S ,]s
tmac/dvi.tmac:.hcode \[s ,]s
tmac/ec.tmac:.\" hcode values are not handled.
tmac/fallbacks.tmac:.\"fchar \[u2011] -\" non-breaking hyphen (won't break w/o
.hcode or \:)
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/fr.tmac:.hcode � �  � �
tmac/lbp.tmac:.  hcode \\$1\\$4
tmac/lbp.tmac:.hcode \[S ,]s
tmac/lbp.tmac:.hcode \[s ,]s
grep: tmac/mdoc: Is a directory
tmac/ps.tmac:.  hcode \\$1\\$4
tmac/ps.tmac:.hcode \[S ,]s
tmac/ps.tmac:.hcode \[s ,]s
tmac/psold.tmac:.ie '\\$3'\(.i' .hcode \\$1i
tmac/psold.tmac:.el .hcode \\$1\\$3
tmac/sv.tmac:.hcode � �  � �
tmac/sv.tmac:.hcode � �  � �
tmac/sv.tmac:.hcode � �  � �
tmac/sv.tmac:.hcode � �  � �
grep: tmac/tests: Is a directory


> in latin1 but not in latin2, then the formatter did somehow take the
> input encoding into account.  Conversely, if the formatter knew
> nothing about the encoding, then past groffs must have treated the
> request the same regardless of whether latin1 or latin2 was in effect.

I can't easily simulate a CCSID 1047 environment (EBCDIC), but I can
look at English, the default language, which uses Latin-1 (and has a õ
character), and Czech, which uses Latin-2 (and lacks a õ character).

Here are the results.


$ printf '.phcode ab\\[:e]\\[~o]\n' | ./build/test-groff
a       97
b       98
\[:e]   0
\[~o]   0
$ printf '.phcode ab\\[:e]\\[~o]\n' | ./build/test-groff -men
a       97
b       98
\[:e]   0
\[~o]   0
$ printf '.phcode ab\\[:e]\\[~o]\n' | ./build/test-groff -mcs
a       97
b       98
\[:e]   0
\[~o]   0


> I was trying to reconcile these when my attention got diverted to
> other (non-groff) matters, where it may remain a while yet.  So this
> question remains unresolved unless you care to tackle it before I'm
> able to return to it.
> 
> But the answer would seem to determine what direction this ticket
> should take.  The byte 0xF5 represents LATIN SMALL LETTER O WITH TILDE
> in Latin-1 and LATIN SMALL LETTER O WITH DOUBLE ACUTE in Latin-2.  So,
> in older groffs with Latin-2 loaded, for the input ".hcode \[~o]"
> followed by the byte 0xF5:
> * If the formatter treated this as reflexive, then this was a bug,
> which goes a long way toward quashing objections to the behavior
> change.

It treated that input as assigning a hyphenation code to a character
that didn't have one before.  Observe.


$ printf '.phcode \\[o~]\n.hcode \\[o~] \365\n.phcode \\[o~]\n' |
./build/test-groff
\[~o]   0
\[~o]   245
$ printf '.phcode \\[~o]\n.hcode \\[~o] \365\n.phcode \\[~o]\n' |
./build/test-groff -men
\[~o]   0
\[~o]   245
$ printf '.phcode \\[~o]\n.hcode \\[~o] \365\n.phcode \\[~o]\n' |
./build/test-groff -mcs
\[~o]   0
\[~o]   245


> * If the formatter didn't treat this as reflexive, then it somehow
> knew the encoding, undermining much of the justification for the
> behavior change.

Again I have lost track of your definition of reflexivity here, so I
remain uncertain as to what you think the above illustrates.

Once reminded where we _weren't_ setting up hyphenation codes as
recently as the 1.23.0 release, I saw nothing in the foregoing that
surprised me.


(file #57163)

    _______________________________________________________

Additional Item Attachment:

File name: phcode-for-groff-1.23.0.diff   Size: 2KiB

<https://file.savannah.gnu.org/file/phcode-for-groff-1.23.0.diff?file_id=57163>


    AGPL NOTICE

These attachments are served by Savane. You can download the corresponding
source code of Savane at
https://savannah.gnu.org/source/savane-962f8dd2e65d30409210f66560945e9bbc413549.tar.gz


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66919>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66919] [troff] behavior change in some .hcode calls when a special character is the first argument

Reply via email to