Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Werner LEMBERG

> However, I am interested in how Groff 1.16 works for UTF-8 input.
> I could not find any code for UTF-8 input, though I found a code for
> UTF-8 output in src/devices/grotty/tty.cc .  Am I missing something?
> (Of course /font/devutf8/* has no implementation of UTF-8 encoding,
> though it seems to have a table for glyph names -> UCS-2.)

No UTF-8 input support.  This is an urgent need for one of the next
major releases.  grotty can output UTF-8 if activated with -Tutf8.


Werner




Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Werner LEMBERG

> > Well, maybe.  But sometimes there is kerning.  Please consult Ken
> > Lunde's `CJKV Information Processing' for details.  Example:
> > 
> >〇
> >一
> >〇
> 
> Wow!  This is the first time I received a Japanese mail from
> non-Japanese speaker.

I *like* the Mew mailing program for Emacs :-)

> I have seen the book at a few bookstores (not translated in Japanese
> yet).  Though I am interested in it, the book is too thick for me...
> (Even if the book were in Japanese, it would be heavy for me.)

I strongly recommend to buy it, especially if you are interested in
character sets and encodings.  Of the 1200+ pages (or so), about 900
pages are mapping tables.


Werner

Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Werner LEMBERG

> 2. Perhaps it is a good point of view to see troff (gtroff) as an
> engine which handles _glyphs_, not characters, in a given context of
> typographic style and layout. The current glyph is defined by the
> current point size, the current font, and the name of the
> "character" which is to be rendered, and troff necessarily takes
> account of the metric information associated with this glyph.

Exactly.  But the current terminology in gtroff is more than
ambiguous, and I believe that we need a clear separation between
characters and glyphs.

> Logically, therefore, troff could be "neutral" about what the byte
> "a" stands for. From that point of view, a troff which makes no
> assumptions of this kind, amd which consults external tables about
> the meaning of its input and about the characteristics of what
> output that input implies, purely for the purpose of correct
> formatting, is perhaps the pure ideal. And from that point of view,
> therefore, unifying the input conventions on the basis of a
> comprehensive encoding (such as UTF-8 or Unicode is intended to
> become) would be a great step towards attaining this neutrality.

I fully agree.  A single input character set (as universal as
possible) is the right thing, and everything else shall be managed by
preprocessors (and a postprocessor for tty).

> Meanwhile, interested parties who have not yet studied it may find
> the "UTF-8 and Unicode FAQ for Unix/Linux" by Markus Kuhn well worth
> reading:
> 
>   http://www.cl.cam.ac.uk/~mgk25/unicode.html

Yes, Markus is doing an excellent job.

> By the way, your comment that hyphenation, for instance, is not a
> "glyph question" is, I think, not wholly correct. Certainly,
> hyphenation _rules_ are not a glyph question: as well as being
> language-dependent, there may also be "house rules" about it; these
> come under "typographic style" as above. But the size of a hyphen
> and associated spacing are glyph issues, and these may interact with
> where a hyphenation occurs or whether it occurs at all, according to
> the rules.

I mean the algorithm of finding possible breakpoints which must be
based on input characters.  The final decision where a word will be
broken is of course a glyph issue.


Werner




Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Tomohiro KUBOTA
Hi,

At Sat, 21 Oct 2000 10:46:51 +0200 (CEST),
Werner LEMBERG <[EMAIL PROTECTED]> wrote:

> In general.  I want to define terms completely independent on any
> particular program.  We have
> 
>   character set
>   character encoding
>   glyph set
>   glyph encoding

I understand.  Since we are discussing on the preprocessor, let's 
concentrate on character, not glyph.  I think you now will agree to
specify the 'character set/encoding' by a single word such as
'EUC-JP' instead of a pair of 'JIS-X-0208' and 'EUC'.

BTW, I am implementing the preprocessor.  Now it has features of:
 - input from standard input (stdin)
 - output to standard output (stdout)
 - I18N directive to support locale-sensible mode
 - hard-coded converter from Latin1, EBCDIC, and UTF-8 to UTF-8
 - locale-sensible converter from any encodings supported by OS to UTF-8
   (note: UTF-8 has to be supported by iconv(3) )
 - encoding for input is determined by command option or default
 - default is 'latin1' when compiled without I18N or locale-sensible when
   compiled with I18N
However I have to implement
 - encoding has to be determined also by '-*- ... -*-' directive in
   the roff source
 - (I18N mode) encoding has to be able to be specified by MIME-style
   and Emacs-style names.
 - efficiency of memory and CPU usage is not considered yet.
 - input from files besides stdin

I will send the source soon.


> >roff source in any encoding like '\(co' (character)
> >   |
> >   |  preprocessor
> >   V
> >UTF-8 stream like u+00a9(character)
> >   |
> >   |  troff
> >   V
> >glyph expression like 'co'  (glyph)
> >   |
> >   |  troff (continuing)
> >   V
> 
> Here is missing a step:
> 
>  typeset output  (glyph)
> |
> |  grotty
> V
> 
> >UTF-8 stream like u+00a9 or '(C)'   (character)
> >   |
> >   |  postprocessor
> >   V
> >formatted text in any encoding  (character)


I understand well.  Thank you for your explanation.
BTW, besides TTY output, HTML will need postprocess from glyph to 
character like 'grotty' in tty mode, since HTML is a text file.
I think the encoding for HTML can be always UTF-8.  We can add a
line between  and 



(I found a code in grohtml.cc to write this line without charset
directive.)

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/