Hi, Oliver! At 2021-07-02T15:38:38+0200, Oliver Corff wrote: > Hi Branden, > > may I ask an obviously ignoramus question: What would the settings for > Italian be different from other languages, except for hyphenation > rules and perhaps the proper choice of quotation marks, decimal > separators, names of dates and fixed headlines ("References", > "Abstract" etc.)?
You've pretty much covered it, if we swap in "inter-sentence spacing amount" for "decimal separator". It seems that the EU has standardized on "no additional inter-sentence space" in its typography, so our Czech, German, French, Italian, and Swedish localization files all say .ss 12 0 . The vast bulk of groff localization is concerned with two things: localization of strings provided by macro packages, and setting up hyphenation patterns. There is a brief how-to document in the groff source tree[1]. I updated it earlier this year, and Edmond Orignac's contribution of Italian prompted me to further improve it in the past few days. We see little in localization files about the decimal separator or quotation marks. roff systems pretty much stick to integer arithmetic and at the level of request syntax and diagnostic output, groff is not localized anyway. The historical full-service macro packages seem not to have been concerned with abstracting quotation, probably because of the limited glyph repertoire available when they were first developed. mdoc does, but since its domain is man pages it tends to encode glyph identities into the semantics of its macro calls (e.g., "Sq" for "single quote this"). > How das groff a) detect that the input file is Italian Important to note here--it doesn't. groff doesn't detect this--it has to be told. > und b) decide which settings to apply? I revamped groff input localization a few months ago. It occurred to me that the mechanism groff had innovated for this purpose (specify options like -mfr for French) was duplicative of an existing and much more widely understood infrastructure for tackling such issues: locale(7). Here's the relevant NEWS item (recently updated in one detail). --- o The groff locale (the default input language) is now determined using the system locale. The LC_ALL and LANG environment variables are checked, in that order. If set, the value's first two characters determine the groff locale. If these variables are not set, if the first one found is set to "C", or if no groff localization file exists for the language, groff falls back to English, loading en.tmac. Those who want groff's default locale to differ from LC_ALL/LANG should edit the troffrc file to source the appropriate groff locale macro file (cs.tmac, de.tmac, den.tmac, fr.tmac, it.tmac, ja.tmac, sv.tmac, zh.tmac). The default hyphenation mode (as used by the .hy request) for users of English is thus changed from "1", which was inappropriate for the TeX-based hyphenation patterns groff has used since at least 1991, to "4". However, invoking .hy without an argument remains synonymous with ".hy 1". --- I have anticipated, but not yet heard, a protest along the lines that just because a (for instance) French document is being typeset, the user might not want to change their locale to begin with "fr". Did I not consider the impact on LC_MESSAGES and the possibility of unwelcome diagnostic messages in French from the groff pipeline? The answer I prepared for this is a simple one; LC_MESSAGES doesn't matter because none of the programs or libraries in the GNU roff project have localized diagnostic messages. Moreover, I've never seen a demand for this expressed, and while I reckon I've read every open Savannah ticket against groff at least once by now[2], I haven't consumed the entire archived history of this mailing list, I nevertheless surmise that it has seldom if ever been requested. My expectation is that people who prefer a "C" or English locale for most of their shell interactions can keep it, and specify an environment variable on a per-command basis as needed to format sundry non-English documents. > I ask because if I typeset a German document then I usually just > insert the request for German hyphenation at the beginning, and I have > one string variable with encloses the argument in proper quotation > marks. That's more or less enough to get going. Would be nice, > though, to make groff autodetect German and set the appropriate > requests. There's nothing wrong with the way you're doing things, especially if you're not using the "me", mm, mom, or ms packages. It sounds like the only thing you're missing, and maybe you just didn't mention it, is the aforementioned ".ss 12 0" request. > Sorry for asking if this obvious to everybody (but me). I suspect that groff's support for localized input documents was not all that obvious in 1.22.4 or earlier, and likely still isn't. Let me dissect the example I posted earlier; I packed several salient points into it. $ file EXPERIMENTS/italian.roff EXPERIMENTS/italian.roff: troff or preprocessor input, UTF-8 Unicode text $ LANG=it_IT ./build/test-groff -b -ww -k EXPERIMENTS/italian.roff L’Italia è diventata un mercato di sfruttamento coloniale, una [...] che non è compromessa nell’avventura della guerra, che non ha ab‐ A. I ran "file" because I wanted to establish that the input was not in a legacy input encoding. Much of the existing groff documentation is preoccupied with input encoding issues. B. I specified "-k" to groff so that it would run preconv, converting non-ASCII code points in the input to groff Unicode special character escapes. C. Instead of saying something like "groff -mit", we can use a standard environment variable to assert the locale. For groff's purposes, simply "LANG=it" will suffice. According to my experiments, I don't need the following in the it.tmac file. .hcode á á Á á .hcode à à À à .hcode è è È è .hcode é é É é .hcode í í Í í .hcode ì ì Ì ì .hcode ó ó Ó ó .hcode ò ò Ò ò .hcode ú ú Ú ú .hcode ù ù Ù ù uhe contributor, Edmond Orignac, seemed uncertain as to whether any .hcode requests were necessary, and I am starting to think they aren't. They exist in other localization macro files because the hyphenation pattern files (from TeX) contain non-ASCII code points, so the pattern file parser[3] has to be told how to intepret them. The ones for Italian don't contain such code points. That would in turn mean that we don't need this in it.tmac either: .\" Default encoding .mso latin1.tmac Removing it would move us one small step toward the future that Werner Lemberg envisioned 21 years ago[4]. Unfortunately just updating all of our other pattern files, while a good thing to do on other grounds, won't get us much closer because (except for English) they continue to contain non-ASCII code points. Worse, they're UTF-8-encoded now, and our pattern file parser doesn't know how to handle that. Regards, Branden [1] https://git.savannah.gnu.org/cgit/groff.git/tree/tmac/LOCALIZATION [2] though Dave Kemper's recall of them is superior to mine [3] src/roff/troff/env.cpp:3784-3989 [4] https://savannah.gnu.org/bugs/?60536
signature.asc
Description: PGP signature