On 17-Oct-00 Werner LEMBERG wrote: > Well, I insist that GNU troff doesn't support multi-byte encodings at > all :-) troff itself should work on a glyph basis only. It has to > work with *glyph names*, be it CJK entities or whatever. Currently, > the conversion from input encoding to glyph entities and the further > processing of glyphs is not clearly separated. From a modular point > of view it makes sense if troff itself is restricted to a single input > encoding (UTF-8) which is basically only meant as a wrapper to glyph > names (cf. \U'xxxx' to enter Unicode encoded characters). Everything > else should be moved to a preprocessor.
Hi Folks, I have now managed to read this correspondence (16-20 Oct) and think about it for a bit. Sorry to have been obliged to leave it on one side until this week ended. On the whole I go with the view that Werner has expressed, here and in other mails. Also, I think that some people are not clear about the distinction between "character" and "glyph" (not sure that I am, in all cases, come to that ... ). I would like to present an even more conservative view than Werner has stated. And, by the way, in the following "troff" means the main formatting program "gtroff". "groff" denotes the whole package. A.1. At present troff accepts 8-bit input, i.e. recognises 256 distinct entities in the input stream (with a small number of exceptions which are "illegal"). It does not really matter that these are interpreted, by default, as iso-latin-1. They could correspond to anything on your screen when you are typing, and you can set up translation macros in troff to make them correspond to anything else (using either the ".char" request or the traditional ".tr" or the new ".trnt" requests). A.2. The direct correspondence between input bytes and characters is defined in the font files for the device. In addition, groups of bytes (such as, represented in ASCII, "\[Do]") can be made to correspond to specific characters named in the font files. A.3. What gets printed or displayed is a "glyph" which is defined by the current font definition for the device. (Even in English, a character such as "A" could be printed as a Times-Roman "A" glyph, a Helvetica BoldItalic "A" glyph, \ ZapfChancery-MediumItalic glyph, ... ). Troff uses the glyph-metric information in the font file to compute its formatting. A.4. Troff is not, and was never intended to be, WYSIWIG. Its concept is that you prepare an input stream (using whatever interface pleases you, and if this shows you say kanji characters then that's fine, so long as you don't expect troff to "see" them as kanji) which, when interpreted by troff, produces printed/displayed output which bear the marks that you want. I don't see anything wrong (except possibly in ease of use) in creating an ASCII input stream in order to generate Japanese output. Preparation of an output stream to drive a device capable of rendering the output is the job of the post-processor (and, provided you have installed appropriate font definition files, I cannot think of anything that would be beyond the PostScript device "devps"). A: It follows that troff is already language-independent, for all languages whose typographic conventions can be achieved by the primitive mechanisms already present in troff. For such languages, there is no need to change troff at all. For some other languages, there are minor extra requirements which would require small extensions to troff which would not interact with existing mechanisms. Major exceptions to language-independence, at present, include all the "left-to-right" languages (Hebrew, Arabic, ... ). I have been studying Dan Berry's implementation of "ffortid" ["ditroff" backwards] which is a post-processor that allows right-to-left text to be correctly printed. I believe that a port to groff is quite feasible. Dan Berry has also done "triroff" [tri-directional troff] for traditional UNIX troff which can in addition do the top-to-bottom printing for Chinese etc. To my untutored eye, the results look OK. This could also be ported to groff. Extra complications can arise in some languages, such as special hyphenation rules (as has been mentioned); presence or absence of particular ligatures [and I think that troff's hard-wired set of ligatures should be replaced by a user-definable set] (e.g. in Turkish you never use "fi" ligature since this suppresses the distinction between dotless-i and i-with-dot); some characters may not end, or may not begin, a line; some characters have different glyphs at the beginning, the middle, or the end of words; and so on. The above are cases where minor extensions of troff are required, but they do not interact with other features of troff and require no radical re-implementation. Some of the complications with specific languages (such as the extra space separating punctuation marks in French) can be set up on a language-specific basis by suitable macros, and require no change at all in troff itself. B: Troff should be able to cope with multi-lingual documents, where several different languages occur in the same document. I do NOT believe that the right way to do this is to extend troff's capacity to recognise thousands of different input encodings covering all the languages which it might be called upon to typeset (e.g. by Unicode or the like). Troff's multi-character naming convention means that anything you could possibly need can be defined, and given a name in the troff input "character set" whenever you really need it, so long as you have the device resources to render the appropriate glyph. If you want to use a multi-byte encoding in your input-preparation software, you can pre-process this with a suitable filter to generate the troff input-sequences you need (I have done this with WordPerfect multinational characters, for instance, which are two-byte entities). C: Error messages and similar communications with the user (which have nothing directly to do with troff's real job) are irrelevant to the question of revising groff. If people would like these to appear in their own language then I'm sure it can be arranged in a way which would require no change whatever in the fundamental workings of troff. CONCLUSION: Troff certainly needs some extensions to cope with the typesetting demands of some languages (of which the major ones that I can think of have been mentioned above). I also believe that there are some features of troff which need to be changed in any case, but these has nothing to do with language or "locale". Apart from this, I believe that troff has all the primitive functionality needed to cope with different languages and that any user can define their own resources for specific languages (including multi-lingual documents). There is certainly a strong argument for people who are expert both in troff and in specific languages to prepare _definitive_ language- specific resources, rather than have different users all doing different and more-or-less adequate jobs on their own; but that is another issue and still does not involve any radical re-design of groff. Therefore, I suggest, troff can basically be left alone; it does not need radical re-implementation. Best wishes to all, Ted. -------------------------------------------------------------------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 284 7749 Date: 20-Oct-00 Time: 20:32:16 ------------------------------ XFMail ------------------------------

