Re: an observation and proposal about hyphenation codes

G. Branden Robinson Tue, 06 Aug 2024 13:53:30 -0700

Hi Dave,

At 2024-08-06T15:28:25-0500, Dave Kemper wrote:
> On Tue, Aug 6, 2024 at 1:34 PM G. Branden Robinson
> <g.branden.robin...@gmail.com> wrote:
> > The hyphenation language (`.hla`) and hyphenation mode (`.hy`) are
> > the same for these two scenarios.
> 
> Yes, sloppy wording on my part.  By "default hyphenation" I meant no
> aspect of it was changed by the input file.  Command-line switches of
> course had an effect.


Understood.

> > Therefore these characters did not acquire nonzero hyphenation
> > codes, and therefore were not valid hyphenation breakpoints.
> >
> > Does this make sense?
> 
> Yes.  It makes me wonder about the wisdom of commit 0629380a9's move
> of the .hcode blocks.  That is, I understand the reasoning for it you
> and Werner put forth, that the underlying groff design didn't
> contemplate a single run needing different languages' hyphenation
> support.

But it also didn't quite rule it out.  We have been generating a
document bearing this requirement since before the 1.23.0 release --
groff-man-pages.{pdf,utf8.txt}.  It switches from English to Swedish and
back to render groff_mmse(7).

You can observe the dance that we perform to achieve this in our
"doc" directory's Automake file.

https://git.savannah.gnu.org/cgit/groff.git/tree/doc/doc.am?h=1.23.0#n251

> But tying an initial hyphenation scheme to a language seems to at
> least tie it to the right thing at the outset, whereas tying it to an
> encoding perhaps doesn't.

There are two aspects to the hyphenation scheme, in this sense.

1.  which characters are letters in the given character encoding
2.  which letters behave exactly like other letters for hyphenation
    purposes in a given language

Point 1 is determined by the character encoding.  Point 2 is too, in
part, for case-folding purposes.

The remainder of point 2 would cover situations like "hyphenate 'n' just
like 'ñ', as Spanish hypothetically might.  However, to date, this
remainder has never been addressed by groff's hyphenation support.  It
could be--it just demands contributors with the requisite knowledge of
their language's hyphenation rules.

You may notice something unusual about "latin5.tmac" in Git HEAD:

.hcode İ i \" exceptional case; move to tr.tmac if we ever get one

...which, I'll grant, makes "point 1" more complicated again.  Most
languages don't change the lettercase mapping rules.  Most languages
aren't Turkish.

I guess I should add

.hcode I ı

too, huh?

> > If so, what I will do is make "en.tmac" `.mso latin1.tmac`.
> 
> That will solve the problem for English.  Are there other language
> files that will need it?

Every other groff localization file for a Western language -- almost --
`mso`s an encoding macro file already.

$ grep mso tmac/{cs,de,den,es,fr,it,ru,sv}.tmac | grep -v trans
tmac/cs.tmac:.mso latin2.tmac
tmac/de.tmac:.mso latin1.tmac
tmac/den.tmac:.do mso de.tmac
tmac/es.tmac:.mso latin9.tmac
tmac/fr.tmac:.mso latin9.tmac
tmac/ru.tmac:.mso koi8-r.tmac
tmac/sv.tmac:.mso latin1.tmac

I will therefore add

.mso latin1.tmac

to both "en.tmac" _and_ "it.tmac".

> Will some language files need other tmac/latin*.tmac sourced?

Yes, but they have them already, and in some cases for a long time.

$ git blame tmac/fr.tmac | grep 'mso.*latin'
fd7264f136 (Werner LEMBERG      2006-02-07 05:46:08 +0000 156) .mso latin9.tmac

Regards,
Branden

signature.asc
Description: PGP signature

Re: an observation and proposal about hyphenation codes

Reply via email to