Follow-up Comment #32, bug #66919 (group groff): At 2025-04-16T17:45:11-0400, Dave wrote: > Follow-up Comment #31, bug #66919 (group groff): > > [comment #29 comment #29:] >> [comment #27 comment #27:] >>> Clearly, "can't" overstates this, because it _was_ regarded as one >>> in past groffs. >> >> ...unless you were on an EBCDIC system, or loaded "latin2.tmac". >> >>> Additionally, in my example input, at the time that .hcode is run, >>> the character encoding is known, >> >> Not to the formatter. > > Both these statements can't be true. If it _was_ regarded as one in > past groffs [snip] > Determining _which_ one is true is, once again, thwarted by past > releases lacking a way to query hyphenation codes, and my attempts to > deduce them through observations of hyphenation behavior have given me > seemingly contradictory results.
Well, I went to the trouble of preparing a patch to add the erstwhile `phcode` request to groff 1.23.0 (attached, or at least I'll try, and hope Savannah's mail gateway gets it into the ticket tracker). I quickly found out that we didn't set up hyphenation codes for most "extended ASCII" character codes in 1.23.0, _except_ in the language localization files! Go ahead, look for 'em! (We're gonna see an ocean of Unicode replacement characters here, because the file of interest are not encoded in UTF-8, but a variety of other encodings). $ git describe 1.23.0 $ grep -a hcode tmac/* tmac/LOCALIZATION: for the locale are set with the .hcode request. tmac/X.tmac:. hcode \\$1\\$4 tmac/X.tmac:.hcode \[S ,]s tmac/X.tmac:.hcode \[s ,]s tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/cs.tmac:.hcode � � � � tmac/de.tmac:.hcode � � � � � � � � � � � � � � tmac/de.tmac:.hcode � � tmac/de.tmac:.hcode � � � � � � � � tmac/de.tmac:.hcode � � � � � � � � tmac/de.tmac:.hcode � � tmac/de.tmac:.hcode � � � � � � � � � � tmac/de.tmac:.hcode � � � � � � tmac/de.tmac:.hcode � � � � � � � � � � � � � � tmac/de.tmac:.hcode � � tmac/de.tmac:.hcode � � � � � � � � tmac/de.tmac:.hcode � � � � � � � � tmac/de.tmac:.hcode � � tmac/de.tmac:.hcode � � � � � � � � � � tmac/de.tmac:.hcode � � � � � � tmac/de.tmac:.hcode � � tmac/dvi.tmac:. hcode \\$1\\$4 tmac/dvi.tmac:.hcode \[,C]c tmac/dvi.tmac:.hcode \[,c]c tmac/dvi.tmac:.hcode \[S ,]s tmac/dvi.tmac:.hcode \[s ,]s tmac/ec.tmac:.\" hcode values are not handled. tmac/fallbacks.tmac:.\"fchar \[u2011] -\" non-breaking hyphen (won't break w/o .hcode or \:) tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/fr.tmac:.hcode � � � � tmac/lbp.tmac:. hcode \\$1\\$4 tmac/lbp.tmac:.hcode \[S ,]s tmac/lbp.tmac:.hcode \[s ,]s grep: tmac/mdoc: Is a directory tmac/ps.tmac:. hcode \\$1\\$4 tmac/ps.tmac:.hcode \[S ,]s tmac/ps.tmac:.hcode \[s ,]s tmac/psold.tmac:.ie '\\$3'\(.i' .hcode \\$1i tmac/psold.tmac:.el .hcode \\$1\\$3 tmac/sv.tmac:.hcode � � � � tmac/sv.tmac:.hcode � � � � tmac/sv.tmac:.hcode � � � � tmac/sv.tmac:.hcode � � � � grep: tmac/tests: Is a directory > in latin1 but not in latin2, then the formatter did somehow take the > input encoding into account. Conversely, if the formatter knew > nothing about the encoding, then past groffs must have treated the > request the same regardless of whether latin1 or latin2 was in effect. I can't easily simulate a CCSID 1047 environment (EBCDIC), but I can look at English, the default language, which uses Latin-1 (and has a õ character), and Czech, which uses Latin-2 (and lacks a õ character). Here are the results. $ printf '.phcode ab\\[:e]\\[~o]\n' | ./build/test-groff a 97 b 98 \[:e] 0 \[~o] 0 $ printf '.phcode ab\\[:e]\\[~o]\n' | ./build/test-groff -men a 97 b 98 \[:e] 0 \[~o] 0 $ printf '.phcode ab\\[:e]\\[~o]\n' | ./build/test-groff -mcs a 97 b 98 \[:e] 0 \[~o] 0 > I was trying to reconcile these when my attention got diverted to > other (non-groff) matters, where it may remain a while yet. So this > question remains unresolved unless you care to tackle it before I'm > able to return to it. > > But the answer would seem to determine what direction this ticket > should take. The byte 0xF5 represents LATIN SMALL LETTER O WITH TILDE > in Latin-1 and LATIN SMALL LETTER O WITH DOUBLE ACUTE in Latin-2. So, > in older groffs with Latin-2 loaded, for the input ".hcode \[~o]" > followed by the byte 0xF5: > * If the formatter treated this as reflexive, then this was a bug, > which goes a long way toward quashing objections to the behavior > change. It treated that input as assigning a hyphenation code to a character that didn't have one before. Observe. $ printf '.phcode \\[o~]\n.hcode \\[o~] \365\n.phcode \\[o~]\n' | ./build/test-groff \[~o] 0 \[~o] 245 $ printf '.phcode \\[~o]\n.hcode \\[~o] \365\n.phcode \\[~o]\n' | ./build/test-groff -men \[~o] 0 \[~o] 245 $ printf '.phcode \\[~o]\n.hcode \\[~o] \365\n.phcode \\[~o]\n' | ./build/test-groff -mcs \[~o] 0 \[~o] 245 > * If the formatter didn't treat this as reflexive, then it somehow > knew the encoding, undermining much of the justification for the > behavior change. Again I have lost track of your definition of reflexivity here, so I remain uncertain as to what you think the above illustrates. Once reminded where we _weren't_ setting up hyphenation codes as recently as the 1.23.0 release, I saw nothing in the foregoing that surprised me. (file #57163) _______________________________________________________ Additional Item Attachment: File name: phcode-for-groff-1.23.0.diff Size: 2KiB <https://file.savannah.gnu.org/file/phcode-for-groff-1.23.0.diff?file_id=57163> AGPL NOTICE These attachments are served by Savane. You can download the corresponding source code of Savane at https://savannah.gnu.org/source/savane-962f8dd2e65d30409210f66560945e9bbc413549.tar.gz _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?66919> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature