Hi Gáspár, At 2024-07-27T08:51:32+0200, Gáspár Gergő wrote: > I'm trying to make justified text look nicer, so I turned to > hyphenation.
A reasonable desire. > Hungarian is not supported out of the box by groff, but I found a tex > patterns file which seems quite good, that is what I tried to use, to > not much success. You didn't indicate where the hyphenation pattern file came from exactly (though its name is suggestive), or attach it, but odds are it's UTF-8 encoded, and that is a problem for GNU troff in its present state, which supports only single-byte encodings. Further, in groff 1.24, support for the ASCII-incompatible CCSID ["code page"] 1047 will be withdrawn. (This matters because UTF-8 is a compatible extension of ISO 646 a.k.a. "ASCII" but not of CCISD 1047.) https://savannah.gnu.org/bugs/?65724 The good news is that if you convert the hyphenation pattern file to ISO Latin-2 (ISO 8859-2), you should be able to use it. > Hyphenation happens, but not as often as I'd hope. After some more > reading, I think the problem might be with the accented Hungarian > characters not having hyphenation codes assigned to them, since > hyphenation seemingly only happens near non-accented vowels. That seems likely to me. The code that reads hyphenation patterns simply does not expect a multi-byte character encoding. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/env.cpp?h=1.23.0#n3856 > These are the requests that I used for hyphenation originally: > .hla hu > .hpf /home/gergo/projekt/jella/huhyphn.tex > .hy 1 This is good, but incomplete, setup. As you noted below, `hcode` requests to assign hyphenation codes to non-ASCII code points are also necessary. It will also help to tell GNU troff to use an accurate input character mapping. .mso latin2.tmac I would insert that request before the `hpf` request, and invoke `hcode` requests _before_ `hpf`, but I will admit I haven't experimented with arbitrary arrangements of these to see what breaks hyphenation support. I will say with some confidence that `hla` should come first, because the hyphenation language is a property of the environment, and is the way you tell GNU troff which language you'll be defining hyphenation codes and patterns for. Due to that property and because you can change environments, it's not difficult to support multilingual documents that apply appropriate hyphenation rules to the input language. > The manual told me that > > > A hyphenation code must be an ordinary character (not a special > > character escape sequence) other than a digit or a space. Yes. In groff 1.24, loading "latin2.tmac" is what will _make_ input characters like ű "ordinary" (otherwise they're out of range and get warned about). $ ~/groff-stable/bin/groff --version | head -n 1 GNU groff version 1.23.0 $ printf '\370\n' | ~/groff-stable/bin/groff -a -winput <beginning of page> </o> $ groff --version | head -n 1 GNU groff version 1.23.0.1589-aa44 $ printf '\370\n' | groff -a -winput<beginning of page> troff:<standard input>:1: warning: character with input code 248 not defined I haven't decided yet but 1.24's default might be to _not_ load "latin1.tmac", since English is the default language and most English doesn't need Latin-1 characters, but more importantly to get users into position for a conversion to UTF-8 support in, I hope, groff 1.25. I don't think there is time given current developer resources to get groff to UTF-8 from end to end in one fell swoop in time for a release this year, and also, assuming that one is going from an 8-bit character encoding to UTF-8 leads to mojibake. Because UTF-8 is compatible with 7-bit ASCII, if groff 1.24 gets people to either (1) use ASCII or (2) make their documents (or their system, via "troffrc" or similar) _declare the default encoding they expect_, we can avoid frustrating them later. See <https://savannah.gnu.org/bugs/?65955>. > So I tried following the example, with these requests: > .hcode á á Á á > .hcode é é É é > .hcode í í Í í > .hcode ó ó Ó ó > .hcode ö ö Ö ö > .hcode ő ő Ő ő > .hcode ú ú Ú ú > .hcode ü ü Ü ü > .hcode ű ű Ű ű I expect that to work if you (1) get the hyphenation patterns recoded to Latin-2 and (2) load "latin2.tmac" before attempting to format text. > However, groff throws errors saying "error: hyphenation code must be > ordinary character". If you're using groff 1.23 or earlier, I'm a little surprised by this. If you're using groff Git's master branch of recent vintage, this is what I would expect. > I tried with and without preconv to no avail. preconv(1) is important in general for non-English language support in groff, but does not apply to this situation because hyphenation pattern files are parsed independently of text formatting and input processing. > The example supplied in the manual, with German characters, didn't > work either. That is also a surprise, if you're running groff 1.23. I will attempt to reproduce this. In Git, it's not a surprise and is a consequence of me not getting around to updating the documentation yet once I started down the road of resolving Savannah ticket #65724. It _should_ work if you load the German localization file. .mso de.tmac Or "groff -mde". But that will also perform the `hcode` requests for you. > What could be the problem here? I hope the foregoing has shed some light. If you resolve these issues you'll be well along the path to being able to contribute Hungarian language support to groff, which I would be interested to integrate. The main items remaining would, I think, be: 1. Select appropriate automatic hyphenation modes (these are dictated by the TeX hyphenation patterns). https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/LOCALIZATION 2. Translate some stock words and phrases used by our macro packages. Here are the German ones, for example. https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/de.tmac?h=1.23.0#n39 Regards, Branden
signature.asc
Description: PGP signature