Re: [Groff] Re: groff: radical re-implementation
Hi, At Fri, 20 Oct 2000 20:32:17 +0100 (BST), (Ted Harding) <[EMAIL PROTECTED]> wrote: > It does not really matter that these are interpreted, by default, as > iso-latin-1. They could correspond to anything on your screen when you > are typing, and you can set up translation macros in troff to make them > correspond to anything else (using either the ".char" request or the > traditional ".tr" or the new ".trnt" requests). Since I am not familiar with the internals of troff, I don't know this works well for multibyte encodings such as EUC-*, UTF-8, and so on. However, there are encodings which have shift state with escape sequence, like ISO-2022-JP, ISO-2022-CN, ISO-2022-INT-1, and so on. Such encodings cannot treated by ".char" . Do you have any positive reason not to support such encodings, even when such encodings can easily supported using standard "locale" technology? We have to prepare many font files for every encodings in the world. It is inefficient. And more, what is the mechanism to choose proper font file for desired encoding? I think the answer is "locale" technology. Groff can check LC_CTYPE category. Thus, all what a user has to do is to set LANG, LC_CTYPE, or LC_ALL environmental for various softwares to work on needed encoding. However, I am interested in how Groff 1.16 works for UTF-8 input. I could not find any code for UTF-8 input, though I found a code for UTF-8 output in src/devices/grotty/tty.cc . Am I missing something? (Of course /font/devutf8/* has no implementation of UTF-8 encoding, though it seems to have a table for glyph names -> UCS-2.) > B: Troff should be able to cope with multi-lingual documents, where > several different languages occur in the same document. I do NOT > believe that the right way to do this is to extend troff's capacity > to recognise thousands of different input encodings covering all the > languages which it might be called upon to typeset (e.g. by Unicode or > the like). This is a confusion of language and encoding. I think UTF-8 support can and should be achieved via locale technology. By using locale technology, a software can be written in encoding-independent way. The software can support any encodings including UTF-8. Why do we have to hard-code UTF-8, though we can use locale technology to support any encodings including UTF-8 ? I agree that we should not extend troff to recognize thousands of encodings if we have to hard-code every encodings. But it is not true. Again, do you have any positive reason not to support encodings other than UTF-8 ? We have many systems using encodings such as EUC-* and ISO-2022-*. Some systems will migrate to UTF-8 soon. Some will migrate after a certain time. Users using UTF-8 and EUC-* may share a system. Some will never migrate into UTF-8. Groff should support all of them. I think you will also have trouble if you cannot use 8-bit encodings. > C: Error messages and similar communications with the user (which > have nothing directly to do with troff's real job) are irrelevant to > the question of revising groff. If people would like these to appear in > their own language then I'm sure it can be arranged in a way which > would require no change whatever in the fundamental workings of troff. This is a different topic. You are right. We can use gettext to handle translation without radical re-implementation of troff nor Groff system. However, IMO, this topic is much less important. You are reminded of this topic because I use a word "locale", aren't you? Sure, locale is also used to change language for messages (LC_MESSAGES category). However, what I am discussing is LC_CTYPE category. Supporting "locale" technology (especially LC_CTYPE category) is important so that Groff can work well with other softwares. P.S. to CHOI Junho: I think you'd better subscribe [EMAIL PROTECTED] or you may lose messages. I'd like you not to leave this discussion because you are also a multibyte-language speaker. --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://surfchem0.riken.go.jp/~kubota/
Re: [Groff] Re: groff: radical re-implementation
Hi, At Fri, 20 Oct 2000 14:14:44 +0200 (CEST), Werner LEMBERG <[EMAIL PROTECTED]> wrote: > > I think *ideograms* have fixed width everywhere. > > Well, maybe. But sometimes there is kerning. Please consult Ken > Lunde's `CJKV Information Processing' for details. Example: > >〇 >一 >〇 Wow! This is the first time I received a Japanese mail from non-Japanese speaker. Everyone, please notice the header of > Content-Type: Text/Plain; charset=iso-2022-jp > Content-Transfer-Encoding: 7bit from Werner's mail. (This mail will have the same header since I cited the Japanese characters.) BTW, you may be right, since I am not an expert on typesetting. However, I prefer fixed-width ideograms even for such cases. I have never seen any printed matters with non-fixed-width ideograms even for such cases. I have seen the book at a few bookstores (not translated in Japanese yet). Though I am interested in it, the book is too thick for me... (Even if the book were in Japanese, it would be heavy for me.) --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://surfchem0.riken.go.jp/~kubota/
Re: [Groff] Re: groff: radical re-implementation
On 21-Oct-00 Tomohiro KUBOTA wrote: > Hi, > > At Fri, 20 Oct 2000 20:32:17 +0100 (BST), > (Ted Harding) <[EMAIL PROTECTED]> wrote: > >> B: Troff should be able to cope with multi-lingual documents, where >> several different languages occur in the same document. I do NOT >> believe that the right way to do this is to extend troff's capacity >> to recognise thousands of different input encodings covering all the >> languages which it might be called upon to typeset (e.g. by Unicode or >> the like). > > This is a confusion of language and encoding. I think UTF-8 support > can and should be achieved via locale technology. By using locale > technology, a software can be written in encoding-independent way. > The software can support any encodings including UTF-8. Why > do we have to hard-code UTF-8, though we can use locale technology > to support any encodings including UTF-8 ? To bring the argument I am trying to present face-to-face with the point you seem to be trying to make: Someone writing a document about Middle Eastern and related literatures may wish to use the Arabic, Persian, Hebrew, Turkish (all of which have different scripts), and also various Central Asian languages (such as Turkmen) which are often written in Cyrillic or in a variant of Cyrillic. Can you explain how this could be handled "via locale technology" in a single document? Even in its present state, groff can handle such material in a straightforward way (though with extra work from the user in order to make the right-to-left languages work correctly -- the basic method would be to get it formatted correctly with these languages input and printed as left-to-right, and then edit the input file, using the formatted output for reference, to reverse the input sequences on a line-by-line basis). [One of the dangers which I fear if groff were re-structured on a "locale" basis, or similar mechanism. is that its flexibility, indeed in principle its universality, would be compromised and limited by the constraints of that mechanism. It is perhaps not recognised widely eough that groff, in its present state, is capable of being greatly extended -- by means of user-defined macros, preprocessors, and post-processors -- without fundamental change to troff.] With best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 284 7749 Date: 21-Oct-00 Time: 15:39:24 -- XFMail --
Re: [Groff] Re: groff: radical re-implementation
Hi, At Sat, 21 Oct 2000 15:39:24 +0100 (BST), (Ted Harding) <[EMAIL PROTECTED]> wrote: > Someone writing a document about Middle Eastern and related literatures > may wish to use the Arabic, Persian, Hebrew, Turkish (all of which have > different scripts), and also various Central Asian languages (such as > Turkmen) which are often written in Cyrillic or in a variant of Cyrillic. > > Can you explain how this could be handled "via locale technology" > in a single document? In UTF-8 or ISO-2022 locale, if the OS supports it. For example, "en_US.UTF-8" . "en_US" part can be anything which the OS supports. Please consult: "docs.sun.com: Unicode Support in the Solaris Operiting Environment" http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT/@Ab2TocView?Ab2Lang=C&Ab2Enc=iso-8859-1 and you will understand how UTF-8 can be used in "locale" technology. Ok, the preprocessor which I am writing will support both "locale" mode and conventional mode (for compatibility). The conventional mode will support Latin-1, EBCDIC, and UTF-8. Thus, you can use UTF-8 even on OSes which don't support UTF-8 locale. > [One of the dangers which I fear if groff were re-structured on a > "locale" basis, or similar mechanism. is that its flexibility, indeed in > principle its universality, would be compromised and limited by the > constraints of that mechanism. It is perhaps not recognised widely > eough that groff, in its present state, is capable of being greatly > extended -- by means of user-defined macros, preprocessors, and > post-processors -- without fundamental change to troff.] Can you write a macro which enable locale-sensible file/tty I/O? --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://surfchem0.riken.go.jp/~kubota/
Re: [Groff] Re: groff: radical re-implementation
> Would it be useful to add to the texinfo documentation a note > explaining that `-a' should only be used for these situations? I've added some words, thanks. Werner
Re: [Groff] Re: groff: radical re-implementation
> A.1. At present troff accepts 8-bit input, i.e. recognises 256 > distinct entities in the input stream (with a small number of > exceptions which are "illegal"). We need at least 20 bit (for Unicode BMP + surrogates) and the special characters. A 32bit wide number is thus the right choice IMHO. > It does not really matter that these are interpreted, by default, as > iso-latin-1. I plan to remove the hard-coded `charXXX' values, moving them to macro files. > A.2. The direct correspondence between input bytes and characters is > defined in the font files for the device. But this isn't the right place. Input character stuff should not be there at all. > I don't see anything wrong (except possibly in ease of use) in > creating an ASCII input stream in order to generate Japanese output. Not everything can resp. should be handled on the glyph level, for example hyphenation. > Preparation of an output stream to drive a device capable of > rendering the output is the job of the post-processor (and, provided > you have installed appropriate font definition files, I cannot think > of anything that would be beyond the PostScript device "devps"). As mentioned in another mail we have to extend the metric directives to cope with the many CJK characters without make troff too slow. > A: It follows that troff is already language-independent, for all > languages whose typographic conventions can be achieved by the > primitive mechanisms already present in troff. For such languages, > there is no need to change troff at all. For some other languages, > there are minor extra requirements which would require small > extensions to troff which would not interact with existing > mechanisms. Correct. The changes we are discussing only affects the input character level and not the troff engine itself (except additional typesetting features for CJK and possibly other languages). > I think that troff's hard-wired set of ligatures should be replaced > by a user-definable set. Definitely. > Some characters have different glyphs at the beginning, the middle, > or the end of words; and so on. Usually, such changes involve contextual analysis which I won't implement. In case a preprocessor is doing this, it has to directly send glyph entities to troff. So this isn't a problem. > B: Troff should be able to cope with multi-lingual documents, where > several different languages occur in the same document. I do NOT > believe that the right way to do this is to extend troff's capacity > to recognise thousands of different input encodings covering all the > languages which it might be called upon to typeset (e.g. by Unicode > or the like). This is done by a preprocessor and not visible to troff itself. troff will see Unicode only. > Troff's multi-character naming convention means that anything you > could possibly need can be defined, and given a name in the troff > input "character set" whenever you really need it, so long as you > have the device resources to render the appropriate glyph. There are only 256 `multi-characters' named `charXXX'. Everything else are glyph entities (even if they behave like a character in most cases). The reality is that groff doesn't really make a difference between a character and a glyph, and it has high priority to me to implement this distinction. I'll probably start with renaming a lot of troff internals. > If you want to use a multi-byte encoding in your input-preparation > software, you can pre-process this with a suitable filter to > generate the troff input-sequences you need (I have done this with > WordPerfect multinational characters, for instance, which are > two-byte entities). This filter will be the yet-to-come preprocessor. > CONCLUSION: Troff certainly needs some extensions to cope with the > typesetting demands of some languages (of which the major ones that > I can think of have been mentioned above). I also believe that there > are some features of troff which need to be changed in any case, but > these has nothing to do with language or "locale". Locales support only affects pre- and postprocessors. Werner
Re: [Groff] Re: groff: radical re-implementation
> 1. Your 'charset' and 'encoding' are for troff or for preprocessor? In general. I want to define terms completely independent on any particular program. We have character set character encoding glyph set glyph encoding >I thought both of them are for preprocessor. The preprocessor >figures out the way to convert the input to UTF-8 from the >information. A groff preprocessor will work as you have described. Under the assumption that you are talking about input characters, the term `encoding' indeed implies the character set(s). After some thinking I have to correct myself: It is better to say that `EUC' is an `encoding scheme' which describes which character ranges and how many bytes are used. Sorry for the confusion. > 2. Which will the pre/postprocessors handle, characters or glyphs? The preprocessor converts from characters to characters (i.e. to Unicode), grotty + postprocessor convert glyph names back to Unicode characters (using a hard-coded table), then from characters to characters. I don't know yet whether it makes sense to unify the latter two programs. > 3. Your 'charset' is for glyph and 'encoding' is for character? >I thought both of them are for character, since I thought both >of them are for preprocessor. My point was to make the distinction clear between `set' and `encoding'. Maybe it is only of academic interest, but it (hopefully) helps to clear up the used terms. > 4. I though we were discussing on (tags in roff souce for) >preprocessor. Is that right? Yes. >roff source in any encoding like '\(co' (character) > | > | preprocessor > V >UTF-8 stream like u+00a9(character) > | > | troff > V >glyph expression like 'co' (glyph) > | > | troff (continuing) > V Here is missing a step: typeset output (glyph) | | grotty V >UTF-8 stream like u+00a9 or '(C)' (character) > | > | postprocessor > V >formatted text in any encoding (character) Werner
Re: [Groff] Re: groff: radical re-implementation
Hi Werner (and all) Thanks for this clarifying explanation. I have a couple of comments, one explanatory, the other which, I think, may point to the core of the question. On 21-Oct-00 Werner LEMBERG wrote: >> Troff's multi-character naming convention means that anything you >> could possibly need can be defined, and given a name in the troff >> input "character set" whenever you really need it, so long as you >> have the device resources to render the appropriate glyph. > > There are only 256 `multi-characters' named `charXXX'. Everything > else are glyph entities (even if they behave like a character in most > cases). The reality is that groff doesn't really make a difference > between a character and a glyph, and it has high priority to me to > implement this distinction. I'll probably start with renaming a lot > of troff internals. 1. Perhaps I should clarify: by "multi-character naming convention" I mean the fact that you can decide to use the sequence of ASCII characters, for instance, "\[O-ogonek]" as the name of a "character". In passing: I see no _logical_ distinction between using a string of ASCII characters to name a "character", and using a string of bytes which implements a UTF-8 encoding. 2. Perhaps it is a good point of view to see troff (gtroff) as an engine which handles _glyphs_, not characters, in a given context of typographic style and layout. The current glyph is defined by the current point size, the current font, and the name of the "character" which is to be rendered, and troff necessarily takes account of the metric information associated with this glyph. The fact that ASCII characters and the iso-latin-1 characters corresponding to byte-values > 128 are (by default) the troff names of "characters" in a group of European languages -- together with certain other marks and symbols -- is logically (in my view) an irrelevant coincidence which happens to be very convenient for people using these languages; but it is not at all necessary. Nothing at all stops you from defining .char a \*a as the name of Greek "alpha", and so on, if you want to simply the typing of input in a passage of Greek using an ASCII interface. Logically, therefore, troff could be "neutral" about what the byte "a" stands for. From that point of view, a troff which makes no assumptions of this kind, amd which consults external tables about the meaning of its input and about the characteristics of what output that input implies, purely for the purpose of correct formatting, is perhaps the pure ideal. And from that point of view, therefore, unifying the input conventions on the basis of a comprehensive encoding (such as UTF-8 or Unicode is intended to become) would be a great step towards attaining this neutrality. However, I wish to think more about this issue. Meanwhile, interested parties who have not yet studied it may find the "UTF-8 and Unicode FAQ for Unix/Linux" by Markus Kuhn well worth reading: http://www.cl.cam.ac.uk/~mgk25/unicode.html By the way, your comment that hyphenation, for instance, is not a "glyph question" is, I think, not wholly correct. Certainly, hyphenation _rules_ are not a glyph question: as well as being language-dependent, there may also be "house rules" about it; these come under "typographic style" as above. But the size of a hyphen and associated spacing are glyph issues, and these may interact with where a hyphenation occurs or whether it occurs at all, according to the rules. An interesting debate! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 284 7749 Date: 21-Oct-00 Time: 23:47:03 -- XFMail --