date:20001022

Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Werner LEMBERG


> However, I am interested in how Groff 1.16 works for UTF-8 input.
> I could not find any code for UTF-8 input, though I found a code for
> UTF-8 output in src/devices/grotty/tty.cc .  Am I missing something?
> (Of course /font/devutf8/* has no implementation of UTF-8 encoding,
> though it seems to have a table for glyph names -> UCS-2.)

No UTF-8 input support.  This is an urgent need for one of the next
major releases.  grotty can output UTF-8 if activated with -Tutf8.


Werner

Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Werner LEMBERG


(B> > Well, maybe.  But sometimes there is kerning.  Please consult Ken
(B> > Lunde's `CJKV Information Processing' for details.  Example:
(B> > 
(B> >$B!;(B
(B> >$B0l(B
(B> >$B!;(B
(B> 
(B> Wow!  This is the first time I received a Japanese mail from
(B> non-Japanese speaker.
(B
(BI *like* the Mew mailing program for Emacs :-)
(B
(B> I have seen the book at a few bookstores (not translated in Japanese
(B> yet).  Though I am interested in it, the book is too thick for me...
(B> (Even if the book were in Japanese, it would be heavy for me.)
(B
(BI strongly recommend to buy it, especially if you are interested in
(Bcharacter sets and encodings.  Of the 1200+ pages (or so), about 900
(Bpages are mapping tables.
(B
(B
(BWerner

Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Werner LEMBERG


> 2. Perhaps it is a good point of view to see troff (gtroff) as an
> engine which handles _glyphs_, not characters, in a given context of
> typographic style and layout. The current glyph is defined by the
> current point size, the current font, and the name of the
> "character" which is to be rendered, and troff necessarily takes
> account of the metric information associated with this glyph.

Exactly.  But the current terminology in gtroff is more than
ambiguous, and I believe that we need a clear separation between
characters and glyphs.

> Logically, therefore, troff could be "neutral" about what the byte
> "a" stands for. From that point of view, a troff which makes no
> assumptions of this kind, amd which consults external tables about
> the meaning of its input and about the characteristics of what
> output that input implies, purely for the purpose of correct
> formatting, is perhaps the pure ideal. And from that point of view,
> therefore, unifying the input conventions on the basis of a
> comprehensive encoding (such as UTF-8 or Unicode is intended to
> become) would be a great step towards attaining this neutrality.

I fully agree.  A single input character set (as universal as
possible) is the right thing, and everything else shall be managed by
preprocessors (and a postprocessor for tty).

> Meanwhile, interested parties who have not yet studied it may find
> the "UTF-8 and Unicode FAQ for Unix/Linux" by Markus Kuhn well worth
> reading:
> 
>   http://www.cl.cam.ac.uk/~mgk25/unicode.html

Yes, Markus is doing an excellent job.

> By the way, your comment that hyphenation, for instance, is not a
> "glyph question" is, I think, not wholly correct. Certainly,
> hyphenation _rules_ are not a glyph question: as well as being
> language-dependent, there may also be "house rules" about it; these
> come under "typographic style" as above. But the size of a hyphen
> and associated spacing are glyph issues, and these may interact with
> where a hyphenation occurs or whether it occurs at all, according to
> the rules.

I mean the algorithm of finding possible breakpoints which must be
based on input characters.  The final decision where a word will be
broken is of course a glyph issue.


Werner

Re: [Groff] Re: groff: radical re-implementation

2000-10-22 Thread Tomohiro KUBOTA

Hi,

At Sat, 21 Oct 2000 10:46:51 +0200 (CEST),
Werner LEMBERG <[EMAIL PROTECTED]> wrote:

> In general.  I want to define terms completely independent on any
> particular program.  We have
> 
>   character set
>   character encoding
>   glyph set
>   glyph encoding

I understand.  Since we are discussing on the preprocessor, let's 
concentrate on character, not glyph.  I think you now will agree to
specify the 'character set/encoding' by a single word such as
'EUC-JP' instead of a pair of 'JIS-X-0208' and 'EUC'.

BTW, I am implementing the preprocessor.  Now it has features of:
 - input from standard input (stdin)
 - output to standard output (stdout)
 - I18N directive to support locale-sensible mode
 - hard-coded converter from Latin1, EBCDIC, and UTF-8 to UTF-8
 - locale-sensible converter from any encodings supported by OS to UTF-8
   (note: UTF-8 has to be supported by iconv(3) )
 - encoding for input is determined by command option or default
 - default is 'latin1' when compiled without I18N or locale-sensible when
   compiled with I18N
However I have to implement
 - encoding has to be determined also by '-*- ... -*-' directive in
   the roff source
 - (I18N mode) encoding has to be able to be specified by MIME-style
   and Emacs-style names.
 - efficiency of memory and CPU usage is not considered yet.
 - input from files besides stdin

I will send the source soon.


> >roff source in any encoding like '\(co' (character)
> >   |
> >   |  preprocessor
> >   V
> >UTF-8 stream like u+00a9(character)
> >   |
> >   |  troff
> >   V
> >glyph expression like 'co'  (glyph)
> >   |
> >   |  troff (continuing)
> >   V
> 
> Here is missing a step:
> 
>  typeset output  (glyph)
> |
> |  grotty
> V
> 
> >UTF-8 stream like u+00a9 or '(C)'   (character)
> >   |
> >   |  postprocessor
> >   V
> >formatted text in any encoding  (character)


I understand well.  Thank you for your explanation.
BTW, besides TTY output, HTML will need postprocess from glyph to 
character like 'grotty' in tty mode, since HTML is a text file.
I think the encoding for HTML can be always UTF-8.  We can add a
line between  and 



(I found a code in grohtml.cc to write this line without charset
directive.)

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/

Re: [Groff] Re: groff: radical re-implementation

Re: [Groff] Re: groff: radical re-implementation

Re: [Groff] Re: groff: radical re-implementation

Re: [Groff] Re: groff: radical re-implementation

4 matches

Site Navigation

Mail list logo

Footer information