> > Well, I insist that GNU troff doesn't support multibyte enodings > > at all :-) troff itself should work on a glyph basis only. It has > > to work with *glyph names*, be it CJK entities or whatever. > > Currently, the conversion from input encoding to glyph entities > > and the further processing of glyphs is not clearly separated. > > From a modular point of view it makes sense if troff itself is > > restricted to a single input encoding (UTF-8) which is basically > > only meant as a wrapper to glyph names (cf. \U'xxxx' to enter > > Unicode encoded characters). Everything else should be moved to a > > preprocessor. > > This paragraph says two things: > - GNU troff will support UTF-8 only. Thus, multibyte encodings > will be not supported. [Though UTF-8 is multibyte :-p ]
This was a typo, sorry. I've meant that I don't want to support multiple multibyte encodings. > - Groff handles glyph, not character. > I don't understand relationship between these two. UTF-8 is a code > for character, not glyph. ISO8859-1 and EUC-JP are also codes for > character. No difference among UTF-8, ISO8859-1, and EUC-JP. Well, this is *very* important. The most famous example is that the character `f', followed by character `i' will be translated into a single glyph `fi' (which has incidentally a Unicode number for historical reasons). A lot of other ligatures don't have a character code. Or consider a font which has 10 or more versions of the `&' character (such a font really exists). Do you see the difference? A font can have multiple glyphs for a single character. For other scripts like Arabic it is necessary to do a lot of contextual analysis to get the right glyphs. Indic scripts like Tamil have about 50 input character codes which map up to 3000 glyphs! Consider the CJK part of Unicode. A lot of Chinese, Korean, Japanese, and Vietnamese glyphs have been unified, but you have to select a proper locale to get the right glyph -- many Japanese people have been misled because a lot of glyphs in the Unicode book which have a JIS character code don't look `right' for Japanese. For me, groff is primarily a text processing tool, and such a program works with glyphs to be printed on paper. A `character' is an abstract concept, basically. Your point of view, I think, is completely different: You treat groff as a filter which just inserts/removes some spaces, newline characters etc. > However, I won't stick to wchar_t or ucs-4 for internal code, though > I have no idea about your '31bit glyph code'. (Maybe I have to > study Omega...) A `glyph code' is just an arbitrary registration number for a glyph specified in the font definition file. It is invariable from the input encoding. Adobe has `official' glyph lists like `Adobe standard' or `Adobe Japan1'. CID encoded PostScript fonts use CMaps to map the input encoding to these glyph IDs. > The name '--locale' is confusing since it has no relation to locale, > i.e., a term which refer to a certain standard technology. I welcome any suggestions for better names... > - Japanese and Chinese text contains few whitespace characters. > (Japanese and Chinese words are not separated by whitespace). > Therefore, different line-breaking algorithm should be used. > (Hyphen character is not used when a word is broken into lines.) > (Modern Korean language contains whitespace characters between > words --- though not words, strictly speaking.) Not really a different line breaking algorithm but more glyph properties (to be set with `.cflags'): disallowing breaks after or before a glyph for implementing kinsoku shori; for implementing shibuaki properly we probably need to extend the .cflags syntax to set glyph properties for whole glyph classes. For the non-CJK experts: `kinsoku shori' means that some CJK glyphs must not start a line (for example, an ideographic comma or closing bracket) resp. end a line (opening brackets). `shibuaki' means `quarter space'; this is the space between CJK characters and Latin characters -- there are Japanese standards which defines all these things in great detail. > - Hyphenation algorithm differs from language to language. What exactly do you mean? The only real difficult language which could be easily supported with groff is Thai (and similar languages). You need at least a dictionary to find word breaks. All other languages can easily be managed with the current algorithm, I believe. > - Almost CJK characters (ideographics, hiragana, katakana, hangul, > and so on) have double width on tty. Since you won't use wchar_t, > you cannot use wcwidth() to get the width for characters. This is not a problem. Just give the proper glyph width in the tty font definition files. > - Latin-1 people may use 0xa9 for '\(co'. However, this character > cannot be read in other encodings. The current Groff convert > '\(co' to 0xa9 in latin1 device and to '(C)' in ascii device. > How it works for future Groff? Use u+00a9? The postprocessor > (see below) cannot convert u+00a9 to '(C)' because the width is > different and typesetting is broken. It is very difficult to > design to avoid this problem... For tty devices, the route is as follows. Let's assume that the input encoding is Latin-1. Then the input character code `0xa9' will be converted to Unicode character `U+00a9' (by the preprocessor). A hard-coded table maps this character code to a glyph with the name `co'. Now troff looks up the metric info in the font definition file. If the target device is an ASCII-capable terminal, the width is three characters (the glyph `co' is defined with the .char request to be equal to `(C)'); if it is a Unicode-capable terminal, the width is one character. After formatting, a hard-coded table maps the glyphs back to Unicode. Note that the last step may fail for glyphs which have no corresponding Unicode value. > > . Finally, we need to divide the -T option into a --device and > > --output-encoding. > > What is the default encoding for tty? I suggest this should be > locale-sensible. (Or, this can be UTF-8 and Groff can invoke a > postprocessor.) I favor UTF-8 + postprocessor. Terminal capabilities should be selected with macro packages; for example, an ASCII terminal would get the options -m ascii --device=tty --output-encoding=ascii the tmac.ascii file would be very similar to tmac.tty + tmac.tty-char. A latin-2 terminal would be -m latin2 --device=tty --output-encoding=latin2 A Unicode terminal emulating an ASCII terminal would be -m ascii --device=tty --output-encoding=utf8 etc. Using a postprocessor we need only a single font definition file for tty devices. > > Yes. The `iconv' preprocessor would then do some trivial, hard-coded > > conversion. > > You mean, the preprocessor is iconv(1) ? Basically yes, with some adaptations to groff. > The preprocessor, provisional name 'gpreconv', will be designed as: > - includes hard-coded converter for latin1, ebcdic, and utf8. > - uses iconv(3) if possible (compiled within internationalized OS). > - parses --input-encoding option. > - default input is latin1 if compiled within non-internationalized > OS. > - default input is locale-sensible if compiled within > internationalized OS. Exactly. > Thus I designed the above 'gpreconv'. Oh, I have to design > 'gpostconv' also. It should be very similar to the preprocessor. Werner