Hi, At Wed, 18 Oct 2000 00:46:46 +0200 (CEST), Werner LEMBERG <[EMAIL PROTECTED]> wrote:
>> - GNU troff will support UTF-8 only. Thus, multibyte encodings >> will be not supported. [Though UTF-8 is multibyte :-p ] > This was a typo, sorry. I've meant that I don't want to support > multiple multibyte encodings. >> - Groff handles glyph, not character. >> I don't understand relationship between these two. UTF-8 is a code >> for character, not glyph. ISO8859-1 and EUC-JP are also codes for >> character. No difference among UTF-8, ISO8859-1, and EUC-JP. Ah, I meant that 'these two' is the relationship between that Groff supports UTF-8 only and that Groff processes glyphs. Sorry for my poor English. However, thank you for explaining glyph. I also understand you understand problems on Japanese character codes well. I also understand the basic design of groff, though I had to read the source code of groff myself... I wonder why people who knows better than I don't join this list and discuss about internationalization. Note that CJK ideographs also has distinction between character and glyph. The most famous example is two variants of a 'tall or high' character. Japanese people regard these two as the same in daily use but Japanese people regard these two as different if they are used in person's names or so on. I don't know how Chinese and Korean people treat them. It may be different. However, IMHO, we should neglect this problem now since there are so far no standard to treat these variants properly. Though it is important, it is not in our scope. > A `glyph code' is just an arbitrary registration number for a glyph > specified in the font definition file. Then the 'font definition file' will be irrationally large. I think at least CJK ideographics and Korean precompiled Hanguls have to be treated in different way. (Ukai has already pointed this problem. jgroff uses 'wchar<EUCcode>' for glyph names of Japanese characters.) > For tty devices, the route is as follows. Let's assume that the input > encoding is Latin-1. Then the input character code `0xa9' will be > converted to Unicode character `U+00a9' (by the preprocessor). > A hard-coded table maps this character code to a glyph with the name > `co'. Now troff looks up the metric info in the font definition file. Yes. > If the target device is an ASCII-capable terminal, the width is three > characters (the glyph `co' is defined with the .char request to be > equal to `(C)'); if it is a Unicode-capable terminal, the width is one > character. After formatting, a hard-coded table maps the glyphs back > to Unicode. How troff knows that the tty device is ASCII-capable or Unicode-capable? --- Ok, I understand it by reading the next line: > -m ascii --device=tty --output-encoding=ascii '-m ascii' tells that. '--output-encoding' will be passed through for postprocessor. A problem. When compiled within internationalized OS, the names for encodings (for iconv(3) and so on) is implementation-dependent (You know, there are many implementation-dependent items in standard C/C++ language). A solution is: we can have a hard-coded translation table between implementation-dependent encoding names and macro names for -m. The table must be changed by OS (by './configure' script or so). A minimal table will be translate every implementation-dependent encoding names into 'ascii' macro, since almost encodings in the world are superset of ASCII. A full table for a OS will cover the list generated by 'iconv --list'. # Though I think some standardization of names for encoding is needed, # it is not our topic now. Since the '-m' option is generated by groff and passed to troff, groff has to have '#ifdef I18N' code. (or, the code can be integrated to the preprocessor if we design the preprocessor to invoke troff.) --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://surfchem0.riken.go.jp/~kubota/