[bug #66112] Map Latin-1 Supplement character hyphenation codes to their base-character equivalents

Dave Mon, 19 Aug 2024 13:21:24 -0700

URL:
  <https://savannah.gnu.org/bugs/?66112>


                 Summary: Map Latin-1 Supplement character hyphenation codes
to their base-character equivalents
                   Group: GNU roff
               Submitter: barx
               Submitted: Mon 19 Aug 2024 03:21:05 PM CDT
                Category: Macro package - others/general
                Severity: 1 - Wish
              Item Group: Feature change
                  Status: None
                 Privacy: Public
             Assigned to: None
             Open/Closed: Open
         Discussion Lock: Any


    _______________________________________________________

Follow-up Comments:


-------------------------------------------------------
Date: Mon 19 Aug 2024 03:21:05 PM CDT By: Dave <barx>
This is a sequel to the recently fixed bug #59397.

Branden presents a simple test file and its result in that bug report.

$ cat EXPERIMENTS/hw-with-special-characters.groff
.ll 1n
r\['e]sum\['e]
.hcode \['e]e
r\['e]sum\['e]
.pl \n[nl]u
$ ~/groff-1.22.3/bin/groff -ww -W break -T utf8
EXPERIMENTS/hw-with-special-characters.groff
résumé
ré-
sumé

(I'm not sure why he chose groff 1.22.3 to illustrate this; 1.22.4 and 1.23
give the same result.)

Running this test file after #59397's pair of fixes
([http://git.savannah.gnu.org/cgit/groff.git/commit/?id=0629380a9 commit
0629380a9] and
[http://git.savannah.gnu.org/cgit/groff.git/commit/?id=56e793e73 commit
56e793e73]) has been applied changes the output to:

ré-
sumé
ré-
sumé


The #59397 fixes do what that bug report's summary and original submission
asked for: assigned hcodes to the Latin-1 Supplement characters.

What those fixes did not do was this extra step Branden wondered about: "it
might be an open question as to whether letters from outside the basic Latin
alphabet should necessarily be hyphenated like their basic Latin 'base
characters'."

I addressed this question, writing: "When a diacritic changes the
syllabication, such as 'expose' vs 'exposé', it will pretty much (I hedge,
but can't think of any exceptions) always do so by adding a syllable, and thus
a potential break point.  The patterns, presumably, are set up for the
unaccented form, meaning groff will never use the additional break point
offered by the accented form.  But that's fine: it's better to not break a
word in an acceptable spot than to break one in an unacceptable spot.

"And anyway, those are the rarer cases.  More commonly, the break points won't
change, such as whether 'coöperate', 'doppelgänger', or 'débâcle' are
written with or without the diacritics."

This point was not subsequently addressed, so this might still be an open
question.  But let's look at some real words.  The below set is limited to the
decorated characters è, ë, ç, and ö.

$ cat 59397.newtest
.hy 4
.
.ll 1n
co\[:o]rdinating
fa\[,c]ade
gar\[,c]onni\[`e]re
pre\[:e]mptively
re\[:e]nacted
unco\[:o]perative
$ nroff -Wbreak 59397.newtest | cat -s
coör-
di-
nat-
ing
façade
garçon-
nière
preëmp-
tively
reë-
n-
acted
un-
coöper-
a-
tive

Post-#59397 groff misses several hyphenation points above (and has at least
one bogus one).  Now let's see what happens if we make the "open question"
change for the four accented characters used in this sample set.

$ ( echo '.hcode \[,c] c \[:e] e \[`e] e \[:o] o'; cat 59397.newtest ) | nroff
-Wbreak | cat -s
co-
ör-
di-
nat-
ing
fa-
çade
gar-
çon-
nière
pre-
ëmp-
tively
reën-
acted
un-
co-
öp-
er-
a-
tive


This is a clear improvement.  While "reënacted" should still have an
additional hyphenation point, this is unrelated to the diaeresis; the
unadorned word is hyphenated that way too.  And certainly one of its previous
break points, "reë- nacted", was subpar.

So this report requests that groff adopt these additional mappings.

I'm not certain whether they're more appropriate for tmac/latin1.tmac (which
is where #59397 added the setting of hyphenation codes for these characters)
or tmac/en.tmac.

If the latter, they may not need to include every Latin-1 alphabetic
character; I'm not aware of any English words that use a thorn or O with
stroke, for example.  There's also not any reasonable single ASCII letter to
map the thorn to; the O with stroke, for completeness, could be mapped to an
ordinary O.







    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?66112>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

signature.asc
Description: PGP signature

[bug #66112] Map Latin-1 Supplement character hyphenation codes to their base-character equivalents

Reply via email to