Re: groff supports Italian input documents now

G. Branden Robinson Fri, 02 Jul 2021 19:50:29 -0700

Hi, Oliver!

At 2021-07-02T15:38:38+0200, Oliver Corff wrote:
> Hi Branden,
> 
> may I ask an obviously ignoramus question: What would the settings for
> Italian be different from other languages, except for hyphenation
> rules and perhaps the proper choice of quotation marks, decimal
> separators, names of dates and fixed headlines ("References",
> "Abstract" etc.)?


You've pretty much covered it, if we swap in "inter-sentence spacing
amount" for "decimal separator".  It seems that the EU has standardized
on "no additional inter-sentence space" in its typography, so our Czech,
German, French, Italian, and Swedish localization files all say
        .ss 12 0
.

The vast bulk of groff localization is concerned with two things:
localization of strings provided by macro packages, and setting up
hyphenation patterns.

There is a brief how-to document in the groff source tree[1].  I updated
it earlier this year, and Edmond Orignac's contribution of Italian
prompted me to further improve it in the past few days.

We see little in localization files about the decimal separator or
quotation marks.  roff systems pretty much stick to integer arithmetic
and at the level of request syntax and diagnostic output, groff is not
localized anyway.  The historical full-service macro packages seem not
to have been concerned with abstracting quotation, probably because of
the limited glyph repertoire available when they were first developed.
mdoc does, but since its domain is man pages it tends to encode glyph
identities into the semantics of its macro calls (e.g., "Sq" for "single
quote this").

> How das groff a) detect that the input file is Italian

Important to note here--it doesn't.  groff doesn't detect this--it has
to be told.

> und b) decide which settings to apply?

I revamped groff input localization a few months ago.  It occurred to me
that the mechanism groff had innovated for this purpose (specify options
like -mfr for French) was duplicative of an existing and much more
widely understood infrastructure for tackling such issues: locale(7).

Here's the relevant NEWS item (recently updated in one detail).

---
o The groff locale (the default input language) is now determined using
  the system locale.  The LC_ALL and LANG environment variables are
  checked, in that order.  If set, the value's first two characters
  determine the groff locale.  If these variables are not set, if the
  first one found is set to "C", or if no groff localization file exists
  for the language, groff falls back to English, loading en.tmac.

  Those who want groff's default locale to differ from LC_ALL/LANG
  should edit the troffrc file to source the appropriate groff locale
  macro file (cs.tmac, de.tmac, den.tmac, fr.tmac, it.tmac, ja.tmac,
  sv.tmac, zh.tmac).

  The default hyphenation mode (as used by the .hy request) for users of
  English is thus changed from "1", which was inappropriate for the
  TeX-based hyphenation patterns groff has used since at least 1991, to
  "4".  However, invoking .hy without an argument remains synonymous
  with ".hy 1".
---

I have anticipated, but not yet heard, a protest along the lines that
just because a (for instance) French document is being typeset, the user
might not want to change their locale to begin with "fr".  Did I not
consider the impact on LC_MESSAGES and the possibility of unwelcome
diagnostic messages in French from the groff pipeline?

The answer I prepared for this is a simple one; LC_MESSAGES doesn't
matter because none of the programs or libraries in the GNU roff project
have localized diagnostic messages.  Moreover, I've never seen a demand
for this expressed, and while I reckon I've read every open Savannah
ticket against groff at least once by now[2], I haven't consumed the
entire archived history of this mailing list, I nevertheless surmise
that it has seldom if ever been requested.

My expectation is that people who prefer a "C" or English locale for
most of their shell interactions can keep it, and specify an environment
variable on a per-command basis as needed to format sundry non-English
documents.

> I ask because if I typeset a German document then I usually just
> insert the request for German hyphenation at the beginning, and I have
> one string variable with encloses the argument in proper quotation
> marks.  That's more or less enough to get going. Would be nice,
> though, to make groff autodetect German and set the appropriate
> requests.

There's nothing wrong with the way you're doing things, especially if
you're not using the "me", mm, mom, or ms packages.  It sounds like the
only thing you're missing, and maybe you just didn't mention it, is the
aforementioned ".ss 12 0" request.

> Sorry for asking if this obvious to everybody (but me).

I suspect that groff's support for localized input documents was not all
that obvious in 1.22.4 or earlier, and likely still isn't.

Let me dissect the example I posted earlier; I packed several salient
points into it.

$ file EXPERIMENTS/italian.roff
EXPERIMENTS/italian.roff: troff or preprocessor input, UTF-8 Unicode text
$ LANG=it_IT ./build/test-groff -b -ww -k EXPERIMENTS/italian.roff
L’Italia  è  diventata  un mercato di sfruttamento coloniale, una
[...]
che non è compromessa nell’avventura della guerra, che non ha ab‐

A. I ran "file" because I wanted to establish that the input was not in
   a legacy input encoding.  Much of the existing groff documentation is
   preoccupied with input encoding issues.
B. I specified "-k" to groff so that it would run preconv, converting
   non-ASCII code points in the input to groff Unicode special character
   escapes.
C. Instead of saying something like "groff -mit", we can use a standard
   environment variable to assert the locale.  For groff's purposes,
   simply "LANG=it" will suffice.

According to my experiments, I don't need the following in the it.tmac
file.

.hcode á á  Á á
.hcode à à  À à
.hcode è è  È è
.hcode é é  É é
.hcode í í  Í í
.hcode ì ì  Ì ì
.hcode ó ó  Ó ó
.hcode ò ò  Ò ò
.hcode ú ú  Ú ú
.hcode ù ù  Ù ù

uhe contributor, Edmond Orignac, seemed uncertain as to whether
any .hcode requests were necessary, and I am starting to think they
aren't.  They exist in other localization macro files because the
hyphenation pattern files (from TeX) contain non-ASCII code points, so
the pattern file parser[3] has to be told how to intepret them.  The
ones for Italian don't contain such code points.

That would in turn mean that we don't need this in it.tmac either:

.\" Default encoding
.mso latin1.tmac

Removing it would move us one small step toward the future that Werner
Lemberg envisioned 21 years ago[4].

Unfortunately just updating all of our other pattern files, while a good
thing to do on other grounds, won't get us much closer because (except
for English) they continue to contain non-ASCII code points.  Worse,
they're UTF-8-encoded now, and our pattern file parser doesn't know how
to handle that.

Regards,
Branden

[1] https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/LOCALIZATION
[2] though Dave Kemper's recall of them is superior to mine
[3] src/roff/troff/env.cpp:3784-3989
[4] https://savannah.gnu.org/bugs/?60536

signature.asc
Description: PGP signature

Re: groff supports Italian input documents now

Reply via email to