Re: [Groff] Re: groff: radical re-implementation
> However, I am interested in how Groff 1.16 works for UTF-8 input. > I could not find any code for UTF-8 input, though I found a code for > UTF-8 output in src/devices/grotty/tty.cc . Am I missing something? > (Of course /font/devutf8/* has no implementation of UTF-8 encoding, > though it seems to have a table for glyph names -> UCS-2.) No UTF-8 input support. This is an urgent need for one of the next major releases. grotty can output UTF-8 if activated with -Tutf8. Werner
Re: [Groff] Re: groff: radical re-implementation
> > Well, maybe. But sometimes there is kerning. Please consult Ken > > Lunde's `CJKV Information Processing' for details. Example: > > > >〇 > >一 > >〇 > > Wow! This is the first time I received a Japanese mail from > non-Japanese speaker. I *like* the Mew mailing program for Emacs :-) > I have seen the book at a few bookstores (not translated in Japanese > yet). Though I am interested in it, the book is too thick for me... > (Even if the book were in Japanese, it would be heavy for me.) I strongly recommend to buy it, especially if you are interested in character sets and encodings. Of the 1200+ pages (or so), about 900 pages are mapping tables. Werner
Re: [Groff] Re: groff: radical re-implementation
> 2. Perhaps it is a good point of view to see troff (gtroff) as an > engine which handles _glyphs_, not characters, in a given context of > typographic style and layout. The current glyph is defined by the > current point size, the current font, and the name of the > "character" which is to be rendered, and troff necessarily takes > account of the metric information associated with this glyph. Exactly. But the current terminology in gtroff is more than ambiguous, and I believe that we need a clear separation between characters and glyphs. > Logically, therefore, troff could be "neutral" about what the byte > "a" stands for. From that point of view, a troff which makes no > assumptions of this kind, amd which consults external tables about > the meaning of its input and about the characteristics of what > output that input implies, purely for the purpose of correct > formatting, is perhaps the pure ideal. And from that point of view, > therefore, unifying the input conventions on the basis of a > comprehensive encoding (such as UTF-8 or Unicode is intended to > become) would be a great step towards attaining this neutrality. I fully agree. A single input character set (as universal as possible) is the right thing, and everything else shall be managed by preprocessors (and a postprocessor for tty). > Meanwhile, interested parties who have not yet studied it may find > the "UTF-8 and Unicode FAQ for Unix/Linux" by Markus Kuhn well worth > reading: > > http://www.cl.cam.ac.uk/~mgk25/unicode.html Yes, Markus is doing an excellent job. > By the way, your comment that hyphenation, for instance, is not a > "glyph question" is, I think, not wholly correct. Certainly, > hyphenation _rules_ are not a glyph question: as well as being > language-dependent, there may also be "house rules" about it; these > come under "typographic style" as above. But the size of a hyphen > and associated spacing are glyph issues, and these may interact with > where a hyphenation occurs or whether it occurs at all, according to > the rules. I mean the algorithm of finding possible breakpoints which must be based on input characters. The final decision where a word will be broken is of course a glyph issue. Werner
Re: [Groff] Re: groff: radical re-implementation
Hi, At Sat, 21 Oct 2000 10:46:51 +0200 (CEST), Werner LEMBERG <[EMAIL PROTECTED]> wrote: > In general. I want to define terms completely independent on any > particular program. We have > > character set > character encoding > glyph set > glyph encoding I understand. Since we are discussing on the preprocessor, let's concentrate on character, not glyph. I think you now will agree to specify the 'character set/encoding' by a single word such as 'EUC-JP' instead of a pair of 'JIS-X-0208' and 'EUC'. BTW, I am implementing the preprocessor. Now it has features of: - input from standard input (stdin) - output to standard output (stdout) - I18N directive to support locale-sensible mode - hard-coded converter from Latin1, EBCDIC, and UTF-8 to UTF-8 - locale-sensible converter from any encodings supported by OS to UTF-8 (note: UTF-8 has to be supported by iconv(3) ) - encoding for input is determined by command option or default - default is 'latin1' when compiled without I18N or locale-sensible when compiled with I18N However I have to implement - encoding has to be determined also by '-*- ... -*-' directive in the roff source - (I18N mode) encoding has to be able to be specified by MIME-style and Emacs-style names. - efficiency of memory and CPU usage is not considered yet. - input from files besides stdin I will send the source soon. > >roff source in any encoding like '\(co' (character) > > | > > | preprocessor > > V > >UTF-8 stream like u+00a9(character) > > | > > | troff > > V > >glyph expression like 'co' (glyph) > > | > > | troff (continuing) > > V > > Here is missing a step: > > typeset output (glyph) > | > | grotty > V > > >UTF-8 stream like u+00a9 or '(C)' (character) > > | > > | postprocessor > > V > >formatted text in any encoding (character) I understand well. Thank you for your explanation. BTW, besides TTY output, HTML will need postprocess from glyph to character like 'grotty' in tty mode, since HTML is a text file. I think the encoding for HTML can be always UTF-8. We can add a line between and (I found a code in grohtml.cc to write this line without charset directive.) --- Tomohiro KUBOTA <[EMAIL PROTECTED]> http://surfchem0.riken.go.jp/~kubota/