Update of bug #66919 (group groff): Status: None => Need Info Assigned to: None => barx
_______________________________________________________ Follow-up Comment #24: [comment #23 comment #23:] > Now we get to where my conceptual groundwork of comment #20 starts > interacting with concrete examples. Good, 'cause I have some too! :D First let me start with this illustrator. $ printf '.ll 3n\ndomain\n' | groff -a -Wbreak <beginning of page> do<hy> main GNU _troff_ will break in the same place any word with a letter equivalent to "o" in the same place. $ printf '.ll 3n\nd\[`o]main\n' | groff -a -Wbreak <beginning of page> d<`o><hy> main Recalling from our discussion in bug #66112, and my selection of your first suggestion over your second, o-with-tilde-accent is _not_ equivalent to "o" in English, so it shouldn't break... $ printf '.ll 3n\nd\[~o]main\n' | groff -a -Wbreak <beginning of page> d<~o>main ...and indeed it doesn't. That established... > That's a fair statement. But even though I'm running groff with its default > startup (English) files, the behavior I'm talking about in this ticket is in > the formatter, not in any startup files. What I'm talking about has nothing > to do with the input _language_ and everything to do with input _encoding_. I agree! > (You'll notice that I'm not providing any sample input with any English > words. The two words I've used, lanteronial, and lanterõnial--and then only > to work around the lack of .pchar in older groffs--aren't part of any > language that I'm aware of. So I'm talking about general formatter behavior, > independent of any language setting.) But you're not talking about _general_ formatter behavior, you're talking about formatter behavior **after the "latin1.tmac" file is loaded**. Observe. $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak <beginning of page> d<~o>main Let's try it with the "raw" o with tilde accent character, Latin-1 245 decimal (365 octal). $ printf '.ll 3n\nd\365main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak <beginning of page> /home/branden/groff-HEAD/bin/troff:<standard input>:2: warning: character with input code 245 not defined dmain $ printf '.ll 3n\nd\365main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak <beginning of page> /home/branden/groff-1.23.0/bin/troff:<standard input>:2: warning: character with input code 245 not defined dmain $ printf '.ll 3n\nd\365main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak <beginning of page> /home/branden/groff-1.22.4/bin/troff: <standard input>:2: warning: can't find character with input code 245 dmain $ printf '.ll 3n\nd\365main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak <beginning of page> <standard input>:2: warning: can't find character with input code 245 dmain This makes sense, because in all released versions of _groff_, the formatter doesn't yet know, before loading startup files, whether it's going to be operating in a Latin-1 or EBCDIC (code page 1047) environment. (Well, technically it *can* know just by checking the character code of, say, "a", but it stays as agnostic as it can and lets macro files do most of the lifting.) Let's macro-load "latin1.tmac" in our examples and see if that changes anything. $ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso latin1.tmac\n.ll 3n\nd\365main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak <beginning of page> d<~o>main The character code is now recognized, and translated on input (`trin`) to the special character `~o`. But it still doesn't hyphenate. For completeness, let's see if explicitly specifying the special character changes behavior. $ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso latin1.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak <beginning of page> d<~o>main Still no. Finally let's load "en.tmac", which didn't exist prior to 1.23.0. $ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/troff -Ra -Wbreak <beginning of page> d<~o>main $ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/troff -Ra -Wbreak /home/branden/groff-1.22.4/bin/troff: <standard input>:1: warning: can't find macro file 'en.tmac' <beginning of page> d<~o>main $ printf '.mso en.tmac\n.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/troff -Ra -Wbreak <standard input>:1: warning: can't find macro file `en.tmac' <beginning of page> d<~o>main So here are a bunch more cases where formatter behavior doesn't change, all using the same special character you've chosen. $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-HEAD/bin/groff -a -Wbreak <beginning of page> d<~o>main $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.23.0/bin/groff -a -Wbreak <beginning of page> d<~o>main $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.4/bin/groff -a -Wbreak <beginning of page> d<~o>main $ printf '.ll 3n\nd\\[~o]main\n' | ~/groff-1.22.3/bin/groff -a -Wbreak <beginning of page> d<~o>main Why does "lanteronial" (not an English word) hyphenate differently from "domain" (definitely an English word)? To answer that requires a source dive, which is coming shortly. But first, I must ask: So the hyphenation of a non-English word using a letter that doesn't exist in the English alphabet has changed from _groff_ 1.23.0 to (what will become) 1.24.0. Is it fair to call that a regression? I think you've identified a relatively dusty crevice in a corner case, and that it arises solely due a presumption that was being made in `set_hyphenation_code()` for many years. So why did commit a52141ac46eef95dd1f85e4c2e0a336affa9bcc9 change things? Let's look at the diff again. diff --git a/src/roff/troff/input.cpp b/src/roff/troff/input.cpp index cc7d9dd71..946b93570 100644 --- a/src/roff/troff/input.cpp +++ b/src/roff/troff/input.cpp @@ -7309,25 +7309,26 @@ static void set_hyphenation_codes() error("cannot use the hyphenation code of a numeral"); break; } - unsigned char new_code = 0; // TODO: int + unsigned char new_code = 0; charinfo *cisrc = tok.get_char(); - if (csrc != 0) - new_code = csrc; - else { + if (cisrc != 0 /* nullptr */) + // Common case: assign destination character the hyphenation code + // of the source character. + new_code = cisrc->get_hyphenation_code(); + if (0 == csrc) { if (0 /* nullptr */ == cisrc) { error("expected ordinary or special character, got %1", tok.description()); break; } - // source character is special - if (0 == cisrc->get_hyphenation_code()) { - error("second member of hyphenation code pair must be an" - " ordinary character, or a special character already" - " assigned a hyphenation code"); - break; - } new_code = cisrc->get_hyphenation_code(); } + else { + // If assigning a ordinary character's hyphenation code to itself, + // use its character code point as the value. + if (csrc == cdst) + new_code = tok.ch(); + } cidst->set_hyphenation_code(new_code); if (cidst->get_translation() && cidst->get_translation()->get_translation_input()) ...and at your test case (the UTF-8 version for readability in Savannah, **not** bug-reproducibility). $ cat EXPERIMENTS/lanteronial-utf8.groff .ll 1n lanteronial lanter\[~o]nial .hcode \[~o] õ lanter\[~o]nial You've only got the one `hcode` invocation, so that's good. What was its path through the old code? https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=89623d044a207c1321bdf106d5f8d5d9e59b7ca1#n7278 Well, we have a bunch of validity checking/error handling first. Eventually, if we've got two (mostly) valid arguments, we end up on line 7312. If `csrc` is not zero, the source character is "ordinary". (If it _is_ zero, it could be anything, like a horizontal motion escape sequence. But in valid cases, if it's zero it's a special or indexed character.) And so that branch should be taken for the "lanteronial" file. `new_code` becomes its value (7315) and we skip to 7331, where the `charinfo` of the destination character is set to that value. We then worry about whether the destination character is "translated" (which I **think** refers to `tr` translation but I haven't ruled out `trin` or `trnt` translations instead, because it seems that no good item of terminology should be permitted to apply to only one concept in a program), if it is, that new code is immediately superseded by that of its translation (7334). Then the function ends. Okay, what about _after_ the "bad commit"? https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=a52141ac46eef95dd1f85e4c2e0a336affa9bcc9#n7278 We start off again at line 7312. We don't make a decision about `csrc` right away. Instead we gather the source character's hyphenation code immediately, if it has one (7314-7317), then if the source character is special, we proceed as before (7319-7324). But in this case, the source character is ordinary, so we check to see if the character is being assigned to itself, and if so apply this "reflexive case" (7329-7330). But we won't take that branch either because the test on line 7329 will fail: `csrc` is 245 decimal, but `cdst` is 0 because it's a special character. We then hit line 7332 where we assign `new_code` to `cidst`. But remember line 7317. `cisrc`'s hyphenation code would be zero, because because that's the value it has when the formatter starts up ("troff -R"), and neither "en.tmac" nor "latin1.tmac" ever assigned it a hyphenation code. The bottom line is that there _is_ a logic change. Before "bad commit", `new_code` got populated presumptively with the character code of the source character, **if the character was ordinary**. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp?id=89623d044a207c1321bdf106d5f8d5d9e59b7ca1#n7315 In the new logic, it doesn't. It didn't occur to me that that assumption was warranted. The character code might not be meaningful as a hyphenation code in the language. `set_hyphenation_code()` has, for many years, been aggressively assuming that it was, if you had the audacity to use an ordinary character as the source character (second argument) in an `hcode` request. I'd say the "bad commit" is a bug fix. So we might retitle this ticket "[troff] behavior change in some .hcode calls when an ordinary character is the second argument", and you can guess what my proposed resolution is. But I want to hear your take. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?66919> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature