Follow-up Comment #19, bug #66919 (group groff): At 2025-03-22T14:01:14-0400, Dave wrote: > [comment #16 comment #16:] >> Follow-up Comment #14, bug #66919 (group groff): >>> But it seems that this mechanism to clear a code needs to carve >>> out an exception for what I'm terming "reflexive hcode," right? >> >> I think it has one. > > Call me old-fashioned, but I prefer to go by how the code behaves > rather than by what it says.
They're the same thing--given sufficient understanding. ;-) (Which I'm sometimes lacking...) > And the results of running the hcode_test in the [comment #0 original > submission] tell us that "reflexive hcode" assigns a new, unique > hyphenation code (what it has always done) when the first parameter is > Latin-1, but not when the first parameter is a special character--even > the special character representing the same character that works in > its Latin-1 form. > >>> Because the purpose of [reflexive hcode] is to generate a new >>> hyphenation code for a character that may never have had one. >> >> ...or reset one to its virgin state after meddling. > > Sorry, I'm not understanding how ".hcode x x" can ever reset a > hyphenation code. I missed out the correct context, which you helpfully clarified with brackets. > It can either do what it has always done, which is to generate a new, > unique hyphenation code; I quibble with the word "generate" here. Presently, there are 256 possible hyphenation codes, 0-255, because they are `unsigned char`s. Hyphenation code assignment, reflexive or not, updates the surjective relation of character codes to hyphenation codes. Since "surjective" is heavy math jargon, let me explain. The character codes and hyphenation codes are both drawn from the exact same set of possible values, 0-255. Hyphenation code assignment creates a relation, or map, from one to the other. We therefore say that the "domain" (of character codes) and "codomain" (of hyphenation codes) have the same elements. However, while all character codes in the domain are in use (GNU troff deals with them one way or another, and has semantics for nearly all), in a typical run of the formatter the same is _not_ true of the codomain. Most possible values of the hyphenation code are not in use. Most character codes map to hyphenation codes of zero. GNU troff's convention is to map the hyphenation codes of uppercase letters to their lowercase counterparts. Observe: $ echo '.pchar aA' | groff 2>&1 | grep -E '(char|hyph)' character 'a' hyphenation code: 97 character 'A' hyphenation code: 97 The _character_ code of 'A' is 65. I can prove it! $ printf '.hcode A A\n.pchar aA\n' | groff 2>&1 | grep -E '(char|hyph)' character 'a' hyphenation code: 97 character 'A' hyphenation code: 65 Thus, given a hyphenation code value, we don't necessarily know which _character_ code its character bears. Equivalently, the relation between character and hyphenation codes is _not_ one-to-one. The relation is _surjective_. Some people use the term "onto" for this kind of relation. (I think the codomain has to also be a [not necessarily proper] subset of the domain for this usage to be correct.) > or it can "reset" the hyphenation code of character x to... the value > it already has. Here's where I disagree with your model. I'd have to dig deeper into the startup code to be sure, but _conceptually_, when GNU troff initializes hyphenation codes, they're a perfect image of the character code values. It then goes through each of them and "fixes them up". Most hyphenation codes get set to zero, the uppercase letters get set to that of their lowercase counterparts, and the lowercase letters are left as-is. Later, when troffrc is read, hyphenation codes may (typically do) get updated. So I would say that this: $ printf '.hcode A A\n.pchar aA\n' | groff 2>&1 | grep -E '(char|hyph)' character 'a' hyphenation code: 97 character 'A' hyphenation code: 65 ...illustrates neither "generation" of a (presumptively new) hyphenation code _nor_ "reset" of "A"'s hyphenation code to "the value it already has". It _was_ 97, and we knocked it back to 65. It would be fair to complain that I might have infected your imagination with the notion of generated hyphenation codes, because I did so to myself in bug #66051. "We have a lot more integers available to us than just 0-255," I thought. Wouldn't it be neat if we could generate a new one for any special character that wanted one, starting at 256, say?" That idea and plan hit the wall, hard. See comment #7 of bug #66051. We are confined to the `unsigned char` straitjacket for now. > OK, but punting the question implies restoring the .hcode behavior to > its 1.23 state. Not if we can argue that the aspect of the behavior you observe is undefined or otherwise undocumented. > That would be the more conservative approach, In a sense. But generally in language development, if you break someone's reliance on undefined behavior, you don't owe them an apology and possibly not even notice. (In this case, I don't object to "notice" unless it makes the "NEWS" file "too long"--an assessment I personally am unlikely to make.) > but it's overkill for resolving this ticket, which is concerned only > with the combination of reflexive hcode, and the first character being > a special character. I concur. > I'd change the Summary to specify this ticket is limited only to > reflexive hcode calls if that were an accepted term rather than one I > invented in the course of talking about this issue. I have trouble interpreting .hcode \[~o] õ as "reflexive" because in the part of GNU troff we're talking about, the source and destination characters are distinguishable. We can observe that fact by inspecting how `pchar` talks about them, even ignoring the reported hyphenation code values. I concede that other aspects of GNU troff's input processing, namely the canned set of special character definitions that completely covers the ISO Latin-1 character set, make the issue muddier. Stepping outside of ISO Latin-1 might clarify. .hcode \[b=] б You might not assume, absent knowledge that English speakers find specialized, that the foregoing is a "reflexive" use of `hcode`. In fact GNU troff can, for 1.24.0, just about get away with this, thanks to the new "tmac/koi8-r.tmac" file. https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/koi8-r.tmac But someone unconcerned with Cyrillic, and not using it in their document, might want to define a special character named `b=` and to feel free either to assume that its hyphenation code will default to zero, or that they can assign it at liberty. Maybe they want this: .char \[b=] \o'b='\" I want a special B for names in my fantasy novel... .hcode \[b=] b\" ...but it's still a kind of "b". ...are they wrong? That's why I say that, for English, you cannot assume that `\[o~]` is going to behave just like `õ`. The latter is _not defined in the English alphabet_, so you can't rely on its hyphenation code having any particular value. Running _groff_ in a Latin-1 terminal, I get this: $ groff .pchar õ character code 245 (U+00F5) is translated does not have a macro special translation: 0 hyphenation code: 0 flags: 0 ASCII code: 245 asciify code: 0 is found is transparently translatable is not translatable as input mode: normal Running _groff_ in a KOI8-R terminal, I get this... (...er, uh, one trip through "dpkg-reconfigure locales" later...) I can't even input the character, because õ doesn't exist in that character encoding. But I can stab at it anyway. $ printf '.pchar \365\n' | groff character code 245 (U+00F5) is translated does not have a macro special translation: 0 hyphenation code: 0 flags: 0 ASCII code: 245 asciify code: 0 is found is transparently translatable is not translatable as input mode: normal "They're the same picture." -- Pam from _The Office_ (U.S.) Yes. And no. But one thing's for sure, in the KOI8-R locale: .hcode \[~o] õ ...is **not** reflexive, because in that locale it really looks like this: .hcode \[~o] Т \" CYRILLIC CAPITAL LETTER TE >> The question I have for you right now is whether _groff_ master is >> working as you expect and desire specifically for Latin-1 characters >> whose hyphenation codes we configure in "en.tmac", > > Yes. I have no quarrel with this, and if I did, it would be out of > scope for this ticket. Okay. >> I also want to know whether spelling them as 8-bit ordinary >> characters or special characters works as you expect. > > Ordinary characters, yes. Special characters, in reflexive context, > no. I claim that, in _groff_, special characters _cannot_, and have never been able to, participate in reflexive hyphenation code assignments. Have I persuaded you? _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?66919> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature