URL: <https://savannah.gnu.org/bugs/?66112>
Summary: Map Latin-1 Supplement character hyphenation codes to their base-character equivalents Group: GNU roff Submitter: barx Submitted: Mon 19 Aug 2024 03:21:05 PM CDT Category: Macro package - others/general Severity: 1 - Wish Item Group: Feature change Status: None Privacy: Public Assigned to: None Open/Closed: Open Discussion Lock: Any _______________________________________________________ Follow-up Comments: ------------------------------------------------------- Date: Mon 19 Aug 2024 03:21:05 PM CDT By: Dave <barx> This is a sequel to the recently fixed bug #59397. Branden presents a simple test file and its result in that bug report. $ cat EXPERIMENTS/hw-with-special-characters.groff .ll 1n r\['e]sum\['e] .hcode \['e]e r\['e]sum\['e] .pl \n[nl]u $ ~/groff-1.22.3/bin/groff -ww -W break -T utf8 EXPERIMENTS/hw-with-special-characters.groff résumé ré- sumé (I'm not sure why he chose groff 1.22.3 to illustrate this; 1.22.4 and 1.23 give the same result.) Running this test file after #59397's pair of fixes ([http://git.savannah.gnu.org/cgit/groff.git/commit/?id=0629380a9 commit 0629380a9] and [http://git.savannah.gnu.org/cgit/groff.git/commit/?id=56e793e73 commit 56e793e73]) has been applied changes the output to: ré- sumé ré- sumé The #59397 fixes do what that bug report's summary and original submission asked for: assigned hcodes to the Latin-1 Supplement characters. What those fixes did not do was this extra step Branden wondered about: "it might be an open question as to whether letters from outside the basic Latin alphabet should necessarily be hyphenated like their basic Latin 'base characters'." I addressed this question, writing: "When a diacritic changes the syllabication, such as 'expose' vs 'exposé', it will pretty much (I hedge, but can't think of any exceptions) always do so by adding a syllable, and thus a potential break point. The patterns, presumably, are set up for the unaccented form, meaning groff will never use the additional break point offered by the accented form. But that's fine: it's better to not break a word in an acceptable spot than to break one in an unacceptable spot. "And anyway, those are the rarer cases. More commonly, the break points won't change, such as whether 'coöperate', 'doppelgänger', or 'débâcle' are written with or without the diacritics." This point was not subsequently addressed, so this might still be an open question. But let's look at some real words. The below set is limited to the decorated characters è, ë, ç, and ö. $ cat 59397.newtest .hy 4 . .ll 1n co\[:o]rdinating fa\[,c]ade gar\[,c]onni\[`e]re pre\[:e]mptively re\[:e]nacted unco\[:o]perative $ nroff -Wbreak 59397.newtest | cat -s coör- di- nat- ing façade garçon- nière preëmp- tively reë- n- acted un- coöper- a- tive Post-#59397 groff misses several hyphenation points above (and has at least one bogus one). Now let's see what happens if we make the "open question" change for the four accented characters used in this sample set. $ ( echo '.hcode \[,c] c \[:e] e \[`e] e \[:o] o'; cat 59397.newtest ) | nroff -Wbreak | cat -s co- ör- di- nat- ing fa- çade gar- çon- nière pre- ëmp- tively reën- acted un- co- öp- er- a- tive This is a clear improvement. While "reënacted" should still have an additional hyphenation point, this is unrelated to the diaeresis; the unadorned word is hyphenated that way too. And certainly one of its previous break points, "reë- nacted", was subpar. So this report requests that groff adopt these additional mappings. I'm not certain whether they're more appropriate for tmac/latin1.tmac (which is where #59397 added the setting of hyphenation codes for these characters) or tmac/en.tmac. If the latter, they may not need to include every Latin-1 alphabetic character; I'm not aware of any English words that use a thorn or O with stroke, for example. There's also not any reasonable single ASCII letter to map the thorn to; the O with stroke, for completeness, could be mapped to an ordinary O. _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?66112> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/
signature.asc
Description: PGP signature