On 17-Oct-00 Werner LEMBERG wrote:
> Well, I insist that GNU troff doesn't support multi-byte encodings at
> all :-) troff itself should work on a glyph basis only. It has to
> work with *glyph names*, be it CJK entities or whatever. Currently,
> the conversion from input encoding to glyph entities and the further
> processing of glyphs is not clearly separated. From a modular point
> of view it makes sense if troff itself is restricted to a single input
> encoding (UTF-8) which is basically only meant as a wrapper to glyph
> names (cf. \U'' to enter Unicode encoded characters). Everything
> else should be moved to a preprocessor.
Hi Folks,
I have now managed to read this correspondence (16-20 Oct) and think
about it for a bit. Sorry to have been obliged to leave it on one
side until this week ended.
On the whole I go with the view that Werner has expressed, here and
in other mails. Also, I think that some people are not clear about
the distinction between "character" and "glyph" (not sure that I am,
in all cases, come to that ... ).
I would like to present an even more conservative view than Werner
has stated. And, by the way, in the following "troff" means the
main formatting program "gtroff". "groff" denotes the whole package.
A.1. At present troff accepts 8-bit input, i.e. recognises 256 distinct
entities in the input stream (with a small number of exceptions which
are "illegal").
It does not really matter that these are interpreted, by default, as
iso-latin-1. They could correspond to anything on your screen when you
are typing, and you can set up translation macros in troff to make them
correspond to anything else (using either the ".char" request or the
traditional ".tr" or the new ".trnt" requests).
A.2. The direct correspondence between input bytes and characters is
defined in the font files for the device. In addition, groups of bytes
(such as, represented in ASCII, "\[Do]") can be made to correspond to
specific characters named in the font files.
A.3. What gets printed or displayed is a "glyph" which is defined by the
current font definition for the device. (Even in English, a character
such as "A" could be printed as a Times-Roman "A" glyph, a Helvetica
BoldItalic "A" glyph, \ ZapfChancery-MediumItalic glyph, ... ).
Troff uses the glyph-metric information in the font file to compute
its formatting.
A.4. Troff is not, and was never intended to be, WYSIWIG. Its concept
is that you prepare an input stream (using whatever interface pleases
you, and if this shows you say kanji characters then that's fine,
so long as you don't expect troff to "see" them as kanji) which,
when interpreted by troff, produces printed/displayed output which
bear the marks that you want. I don't see anything wrong (except
possibly in ease of use) in creating an ASCII input stream in
order to generate Japanese output. Preparation of an output
stream to drive a device capable of rendering the output is
the job of the post-processor (and, provided you have installed
appropriate font definition files, I cannot think of anything
that would be beyond the PostScript device "devps").
A: It follows that troff is already language-independent, for all
languages whose typographic conventions can be achieved by the primitive
mechanisms already present in troff. For such languages, there is no
need to change troff at all. For some other languages, there are
minor extra requirements which would require small extensions to
troff which would not interact with existing mechanisms.
Major exceptions to language-independence, at present, include all
the "left-to-right" languages (Hebrew, Arabic, ... ). I have been
studying Dan Berry's implementation of "ffortid" ["ditroff" backwards]
which is a post-processor that allows right-to-left text to be
correctly printed. I believe that a port to groff is quite feasible.
Dan Berry has also done "triroff" [tri-directional troff] for traditional
UNIX troff which can in addition do the top-to-bottom printing for
Chinese etc. To my untutored eye, the results look OK. This could also be
ported to groff.
Extra complications can arise in some languages, such as special
hyphenation rules (as has been mentioned); presence or absence of
particular ligatures [and I think that troff's hard-wired set
of ligatures should be replaced by a user-definable set] (e.g. in
Turkish you never use "fi" ligature since this suppresses the
distinction between dotless-i and i-with-dot); some characters may
not end, or may not begin, a line; some characters have different glyphs
at the beginning, the middle, or the end of words; and so on.
The above are cases where minor extensions of troff are required,
but they do not interact with other features of troff and require
no radical re-implementation.
Some of the complications with specific languages (such as the extra
space separating punctuation marks in French) can be set up on
a language-specific basis by suitable macros, and require no change at
all in tro