Re: an observation and proposal about hyphenation codes

2024-08-06 Thread Dave Kemper
On Wed, Jul 31, 2024 at 11:51 PM Werner LEMBERG  wrote:
> > However, in the meantime, meaning for groff 1.24, I propose to move
> > `hcode` definitions to where they make more sense: the character set
> > macro files "koi8-r.tmac", "latin1.tmac", "latin2.tmac", and
> > "latin9.tmac".  (If/when I do that, I'll need to update the
> > "tmac/LOCALIZATION" file accordingly.)
>
> Probably a good idea.  The few cases where this has to be changed
> (classical example: Turkish needs 'İ' mapped to 'i' and 'I' mapped to
> 'ı') can be overridden in a language-specific hyphenation setup.

This change was made in commit 0629380a9
(http://git.savannah.gnu.org/cgit/groff.git/commit/?id=0629380a9).
But it doesn't seem to be the right solution, because now hyphenation
varies depending on output format.  This surely can't be the intent.

$ cat test.59397
.hy 4
.ll 1u
resume
r\['e]sum\['e]
$ groff --version | head -1
GNU groff version 1.23.0.1624-4d251-dirty
$ groff -Wbreak -Tlatin1 test.59397 | cat -s
re-
sume
ré-
sumé

$ groff -Wbreak -a test.59397

re
sume
r<'e>sum<'e>



Re: an observation and proposal about hyphenation codes

2024-08-06 Thread G. Branden Robinson
At 2024-08-06T08:41:49-0500, Dave Kemper wrote:
> On Wed, Jul 31, 2024 at 11:51 PM Werner LEMBERG  wrote:
> > > However, in the meantime, meaning for groff 1.24, I propose to move
> > > `hcode` definitions to where they make more sense: the character set
> > > macro files "koi8-r.tmac", "latin1.tmac", "latin2.tmac", and
> > > "latin9.tmac".  (If/when I do that, I'll need to update the
> > > "tmac/LOCALIZATION" file accordingly.)
> >
> > Probably a good idea.  The few cases where this has to be changed
> > (classical example: Turkish needs 'İ' mapped to 'i' and 'I' mapped to
> > 'ı') can be overridden in a language-specific hyphenation setup.
> 
> This change was made in commit 0629380a9
> (http://git.savannah.gnu.org/cgit/groff.git/commit/?id=0629380a9).
> But it doesn't seem to be the right solution, because now hyphenation
> varies depending on output format.  This surely can't be the intent.
> 
> $ cat test.59397
> .hy 4
> .ll 1u
> resume
> r\['e]sum\['e]
> $ groff --version | head -1
> GNU groff version 1.23.0.1624-4d251-dirty
> $ groff -Wbreak -Tlatin1 test.59397 | cat -s
> re-
> sume
> ré-
> sumé
> 
> $ groff -Wbreak -a test.59397
> 
> re
> sume
> r<'e>sum<'e>

I'm thinking this has more to do with line length handling than
hyphenation; `-ww` before `-Wbreak` might have helped.

And the line length will of course have a big impact on where
hyphenation breaks occur.

But as it happens I can't reproduce this misbehavior anyway.

$ cat EXPERIMENTS/resume-special.groff
.ec @
.ll 1u
r@['e]sum@['e]
.hcode @['e] e
r@['e]sum@['e]
.hcode @['E] @['e]
R@['E]SUM@['E]
.pl @n[nl]u

(The changed escape character is because the foregoing is the basis of a
regression test; it gets stuff into a shell variable assignment, and I
am fearful of Unix shell vendors not getting backslash handling correct
or consistent.)

$ ./build/test-groff -Tutf8 -ww -Wbreak EXPERIMENTS/resume-special.groff
troff:EXPERIMENTS/resume-special.groff:2: warning: setting computed line length 
0u to device horizontal motion quantum
ré‐
sumé
ré‐
sumé
RÉ‐
SUMÉ
$ ./build/test-groff -Tps -a -ww -Wbreak EXPERIMENTS/resume-special.groff

r<'e>sum<'e>
r<'e>
sum<'e>
R<'E>
SUM<'E>

I'm running my working copy, of course, but that hasn't changed much
since I pushed.

99a5af3c7 (HEAD -> master) ChangeLog: Add bug-closer for old entry.
4d2514bd4 (origin/master, origin/HEAD) [docs]: Fix content, style, and markup 
nits.

Can you reproduce my results above, using my input and command lines?

Regards,
Branden


signature.asc
Description: PGP signature


Re: an observation and proposal about hyphenation codes

2024-08-06 Thread Dave Kemper
On Tue, Aug 6, 2024 at 9:48 AM G. Branden Robinson
 wrote:
> I'm thinking this has more to do with line length handling than
> hyphenation; `-ww` before `-Wbreak` might have helped.

I'm half-certain it has to do with when latin1.tmac is loaded and when it isn't.

$ echo ".tm Hi, I'm latin1.tmac!" >> tmac/latin1.tmac
$ groff-latest -a < /dev/null
$ groff-latest -Tutf8 < /dev/null
Hi, I'm latin1.tmac!
$ groff-latest -Tascii < /dev/null
$

OK, now I'm certain.

> But as it happens I can't reproduce this misbehavior anyway.

You DID reproduce it.  Look at the first output line of each of your test cases:

> $ ./build/test-groff -Tutf8 -ww -Wbreak EXPERIMENTS/resume-special.groff
> troff:EXPERIMENTS/resume-special.groff:2: warning: setting computed line 
> length 0u to device horizontal motion quantum
> ré‐
> sumé

vs

> $ ./build/test-groff -Tps -a -ww -Wbreak EXPERIMENTS/resume-special.groff
> 
> r<'e>sum<'e>

This is the only line in your test file output before any .hcode
requests were run, so this shows the default hyphenation for the
system.



Re: an observation and proposal about hyphenation codes

2024-08-06 Thread G. Branden Robinson
Hi Dave,

At 2024-08-06T12:08:29-0500, Dave Kemper wrote:
> On Tue, Aug 6, 2024 at 9:48 AM G. Branden Robinson
> I'm [...]certain it has to do with when latin1.tmac is loaded and when
> it isn't.
> 
> $ echo ".tm Hi, I'm latin1.tmac!" >> tmac/latin1.tmac
> $ groff-latest -a < /dev/null
> $ groff-latest -Tutf8 < /dev/null
> Hi, I'm latin1.tmac!
> $ groff-latest -Tascii < /dev/null
> $
[...]
> You DID reproduce it.  Look at the first output line of each of your
> test cases:

Yes, you've got it.  I:

1.  hyperfocused on the full-caps RÉSUMÉ case because that was the
failing instance in a regression test recently added to the suite (a
case contributed by you, as I recall), and

2.  forgot that "en.tmac" is going to have to select a character
encoding even if none of the hyphenation patterns in "hyphen.en"
actually use characters from the Latin-1 Supplement (and they
don't).

You can even/still override the language's choice of character encoding.
Caveat dictator.

$ ./build/test-groff -Tps -a -m latin1 -ww -Wbreak 
EXPERIMENTS/resume-special.groff
.hy=4

r<'e>
sum<'e>
r<'e>
sum<'e>
R<'E>
SUM<'E>
$ ./build/test-groff -Tps -a -m latin9 -ww -Wbreak 
EXPERIMENTS/resume-special.groff
.hy=4

r<'e>
sum<'e>
r<'e>
sum<'e>
R<'E>
SUM<'E>

> OK, now I'm certain.
> 
> > But as it happens I can't reproduce this misbehavior anyway.
> 
> > $ ./build/test-groff -Tutf8 -ww -Wbreak EXPERIMENTS/resume-special.groff
> > troff:EXPERIMENTS/resume-special.groff:2: warning: setting computed line 
> > length 0u to device horizontal motion quantum
> > ré‐
> > sumé
> 
> vs
> 
> > $ ./build/test-groff -Tps -a -ww -Wbreak EXPERIMENTS/resume-special.groff
> > 
> > r<'e>sum<'e>
> 
> This is the only line in your test file output before any .hcode
> requests were run, so this shows the default hyphenation for the
> system.

Well, kind of.  The hyphenation language (`.hla`) and hyphenation mode
(`.hy`) are the same for these two scenarios.  What's happened is that
these requests in "latin1.tmac" didn't get read, because the file wasn't
sourced at all.

.hcode é é
.hcode É é

Therefore these characters did not acquire nonzero hyphenation codes,
and therefore were not valid hyphenation breakpoints.

Does this make sense?

If so, what I will do is make "en.tmac" `.mso latin1.tmac`.

And add another regression test case.

Thanks for the report!

The subtleties involved in machine-driven hyphenation seem to be
endless.  Someone ought to write a Ph.D. thesis about how hard it is.[1]

Regards,
Branden

[1] Yes, I know they did.  I added a citation of it to the groff Texinfo
manual a while back.


signature.asc
Description: PGP signature


Re: an observation and proposal about hyphenation codes

2024-08-06 Thread Dave Kemper
On Tue, Aug 6, 2024 at 1:34 PM G. Branden Robinson
 wrote:
> At 2024-08-06T12:08:29-0500, Dave Kemper wrote:
> > This is the only line in your test file output before any .hcode
> > requests were run, so this shows the default hyphenation for the
> > system.
>
> Well, kind of.  The hyphenation language (`.hla`) and hyphenation mode
> (`.hy`) are the same for these two scenarios.

Yes, sloppy wording on my part.  By "default hyphenation" I meant no
aspect of it was changed by the input file.  Command-line switches of
course had an effect.

> Therefore these characters did not acquire nonzero hyphenation codes,
> and therefore were not valid hyphenation breakpoints.
>
> Does this make sense?

Yes.  It makes me wonder about the wisdom of commit 0629380a9's move
of the .hcode blocks.  That is, I understand the reasoning for it you
and Werner put forth, that the underlying groff design didn't
contemplate a single run needing different languages' hyphenation
support.  But tying an initial hyphenation scheme to a language seems
to at least tie it to the right thing at the outset, whereas tying it
to an encoding perhaps doesn't.

> If so, what I will do is make "en.tmac" `.mso latin1.tmac`.

That will solve the problem for English.  Are there other language
files that will need it?  Will some language files need other
tmac/latin*.tmac sourced?  Those are questions beyond my monolingual
knowledge.



Re: an observation and proposal about hyphenation codes

2024-08-06 Thread G. Branden Robinson
Hi Dave,

At 2024-08-06T15:28:25-0500, Dave Kemper wrote:
> On Tue, Aug 6, 2024 at 1:34 PM G. Branden Robinson
>  wrote:
> > The hyphenation language (`.hla`) and hyphenation mode (`.hy`) are
> > the same for these two scenarios.
> 
> Yes, sloppy wording on my part.  By "default hyphenation" I meant no
> aspect of it was changed by the input file.  Command-line switches of
> course had an effect.

Understood.

> > Therefore these characters did not acquire nonzero hyphenation
> > codes, and therefore were not valid hyphenation breakpoints.
> >
> > Does this make sense?
> 
> Yes.  It makes me wonder about the wisdom of commit 0629380a9's move
> of the .hcode blocks.  That is, I understand the reasoning for it you
> and Werner put forth, that the underlying groff design didn't
> contemplate a single run needing different languages' hyphenation
> support.

But it also didn't quite rule it out.  We have been generating a
document bearing this requirement since before the 1.23.0 release --
groff-man-pages.{pdf,utf8.txt}.  It switches from English to Swedish and
back to render groff_mmse(7).

You can observe the dance that we perform to achieve this in our
"doc" directory's Automake file.

https://git.savannah.gnu.org/cgit/groff.git/tree/doc/doc.am?h=1.23.0#n251

> But tying an initial hyphenation scheme to a language seems to at
> least tie it to the right thing at the outset, whereas tying it to an
> encoding perhaps doesn't.

There are two aspects to the hyphenation scheme, in this sense.

1.  which characters are letters in the given character encoding
2.  which letters behave exactly like other letters for hyphenation
purposes in a given language

Point 1 is determined by the character encoding.  Point 2 is too, in
part, for case-folding purposes.

The remainder of point 2 would cover situations like "hyphenate 'n' just
like 'ñ', as Spanish hypothetically might.  However, to date, this
remainder has never been addressed by groff's hyphenation support.  It
could be--it just demands contributors with the requisite knowledge of
their language's hyphenation rules.

You may notice something unusual about "latin5.tmac" in Git HEAD:

.hcode İ i \" exceptional case; move to tr.tmac if we ever get one

...which, I'll grant, makes "point 1" more complicated again.  Most
languages don't change the lettercase mapping rules.  Most languages
aren't Turkish.

I guess I should add

.hcode I ı

too, huh?

> > If so, what I will do is make "en.tmac" `.mso latin1.tmac`.
> 
> That will solve the problem for English.  Are there other language
> files that will need it?

Every other groff localization file for a Western language -- almost --
`mso`s an encoding macro file already.

$ grep mso tmac/{cs,de,den,es,fr,it,ru,sv}.tmac | grep -v trans
tmac/cs.tmac:.mso latin2.tmac
tmac/de.tmac:.mso latin1.tmac
tmac/den.tmac:.do mso de.tmac
tmac/es.tmac:.mso latin9.tmac
tmac/fr.tmac:.mso latin9.tmac
tmac/ru.tmac:.mso koi8-r.tmac
tmac/sv.tmac:.mso latin1.tmac

I will therefore add

.mso latin1.tmac

to both "en.tmac" _and_ "it.tmac".

> Will some language files need other tmac/latin*.tmac sourced?

Yes, but they have them already, and in some cases for a long time.

$ git blame tmac/fr.tmac | grep 'mso.*latin'
fd7264f136 (Werner LEMBERG  2006-02-07 05:46:08 + 156) .mso latin9.tmac

Regards,
Branden


signature.asc
Description: PGP signature


Re: an observation and proposal about hyphenation codes

2024-08-06 Thread Dave Kemper
On Tue, Aug 6, 2024 at 3:53 PM G. Branden Robinson
 wrote:
> At 2024-08-06T15:28:25-0500, Dave Kemper wrote:
> > That will solve the problem for English.  Are there other language
> > files that will need it?
>
> Every other groff localization file for a Western language -- almost --
> `mso`s an encoding macro file already.
>
> > Will some language files need other tmac/latin*.tmac sourced?
>
> Yes, but they have them already, and in some cases for a long time.

Ah, two questions I could've answered myself by merely looking at the
relevant files.  Thanks for the spoon-feeding.