Re: hyphenating non-english characters

G. Branden Robinson Sat, 27 Jul 2024 13:54:30 -0700

Hi Gáspár,

At 2024-07-27T08:51:32+0200, Gáspár Gergő wrote:
> I'm trying to make justified text look nicer, so I turned to
> hyphenation.


A reasonable desire.

> Hungarian is not supported out of the box by groff, but I found a tex
> patterns file which seems quite good, that is what I tried to use, to
> not much success.

You didn't indicate where the hyphenation pattern file came from
exactly (though its name is suggestive), or attach it, but odds are it's
UTF-8 encoded, and that is a problem for GNU troff in its present state,
which supports only single-byte encodings.  Further, in groff 1.24,
support for the ASCII-incompatible CCSID ["code page"] 1047 will be
withdrawn.  (This matters because UTF-8 is a compatible extension of ISO
646 a.k.a. "ASCII" but not of CCISD 1047.)

https://savannah.gnu.org/bugs/?65724

The good news is that if you convert the hyphenation pattern file to ISO
Latin-2 (ISO 8859-2), you should be able to use it.

> Hyphenation happens, but not as often as I'd hope.  After some more
> reading, I think the problem might be with the accented Hungarian
> characters not having hyphenation codes assigned to them, since
> hyphenation seemingly only happens near non-accented vowels.

That seems likely to me.  The code that reads hyphenation patterns
simply does not expect a multi-byte character encoding.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/env.cpp?h=1.23.0#n3856

> These are the requests that I used for hyphenation originally:
> .hla hu
> .hpf /home/gergo/projekt/jella/huhyphn.tex
> .hy 1

This is good, but incomplete, setup.  As you noted below, `hcode`
requests to assign hyphenation codes to non-ASCII code points are also
necessary.

It will also help to tell GNU troff to use an accurate input character
mapping.

.mso latin2.tmac

I would insert that request before the `hpf` request, and invoke `hcode`
requests _before_ `hpf`, but I will admit I haven't experimented with
arbitrary arrangements of these to see what breaks hyphenation support.

I will say with some confidence that `hla` should come first, because
the hyphenation language is a property of the environment, and is the
way you tell GNU troff which language you'll be defining hyphenation
codes and patterns for.

Due to that property and because you can change environments, it's not
difficult to support multilingual documents that apply appropriate
hyphenation rules to the input language.

> The manual told me that
>
> > A hyphenation code must be an ordinary character (not a special
> > character escape sequence) other than a digit or a space.

Yes.  In groff 1.24, loading "latin2.tmac" is what will _make_ input
characters like ű "ordinary" (otherwise they're out of range and get
warned about).

$ ~/groff-stable/bin/groff --version | head -n 1
GNU groff version 1.23.0
$ printf '\370\n' | ~/groff-stable/bin/groff -a -winput
<beginning of page>
</o>
$ groff --version | head -n 1
GNU groff version 1.23.0.1589-aa44
$ printf '\370\n' | groff -a -winput<beginning of page>
troff:<standard input>:1: warning: character with input code 248 not defined

I haven't decided yet but 1.24's default might be to _not_ load
"latin1.tmac", since English is the default language and most English
doesn't need Latin-1 characters, but more importantly to get users into
position for a conversion to UTF-8 support in, I hope, groff 1.25.  I
don't think there is time given current developer resources to get groff
to UTF-8 from end to end in one fell swoop in time for a release this
year, and also, assuming that one is going from an 8-bit character
encoding to UTF-8 leads to mojibake.  Because UTF-8 is compatible with
7-bit ASCII, if groff 1.24 gets people to either (1) use ASCII or (2)
make their documents (or their system, via "troffrc" or similar)
_declare the default encoding they expect_, we can avoid frustrating
them later.

See <https://savannah.gnu.org/bugs/?65955>.

> So I tried following the example, with these requests:
> .hcode á á Á á
> .hcode é é É é
> .hcode í í Í í
> .hcode ó ó Ó ó
> .hcode ö ö Ö ö
> .hcode ő ő Ő ő
> .hcode ú ú Ú ú
> .hcode ü ü Ü ü
> .hcode ű ű Ű ű

I expect that to work if you (1) get the hyphenation patterns recoded to
Latin-2 and (2) load "latin2.tmac" before attempting to format text.

> However, groff throws errors saying "error: hyphenation code must be
> ordinary character".

If you're using groff 1.23 or earlier, I'm a little surprised by this.

If you're using groff Git's master branch of recent vintage, this is
what I would expect.

> I tried with and without preconv to no avail.

preconv(1) is important in general for non-English language support in
groff, but does not apply to this situation because hyphenation pattern
files are parsed independently of text formatting and input processing.

> The example supplied in the manual, with German characters, didn't
> work either.

That is also a surprise, if you're running groff 1.23.  I will attempt
to reproduce this.  In Git, it's not a surprise and is a consequence of
me not getting around to updating the documentation yet once I started
down the road of resolving Savannah ticket #65724.

It _should_ work if you load the German localization file.

.mso de.tmac

Or "groff -mde".  But that will also perform the `hcode` requests for
you.

> What could be the problem here?

I hope the foregoing has shed some light.

If you resolve these issues you'll be well along the path to being able
to contribute Hungarian language support to groff, which I would be
interested to integrate.  The main items remaining would, I think, be:

1.  Select appropriate automatic hyphenation modes (these are dictated
    by the TeX hyphenation patterns).

    https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/LOCALIZATION

2.  Translate some stock words and phrases used by our macro packages.

    Here are the German ones, for example.

    https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/de.tmac?h=1.23.0#n39

Regards,
Branden

signature.asc
Description: PGP signature

Re: hyphenating non-english characters

Reply via email to