Re: [Groff] Re: groff: radical re-implementation

2000-10-20 Thread Werner LEMBERG

> > This is not true.  Encoding does *not* imply the character set.
> > You are talking about charset/encoding tags.
> 
> Hmm, I cannot understand your idea...
> 
> I intend to mean
>  - character set: CCS (Coded Character Set) in RFC 2130
>  - encoding: CES (Character Encoding Scheme) in RFC 2130

First of all:  We both mean the same, and we agree how to handle the
problem in groff.  I'm only arguing about technical terms.

Another try.

Consider a PostScript font with its encoding vector.  You have a
single glyph set which can map to multiple encodings.  My intention is
to use the terms `set' and `encoding' in a consistent way -- I want to
avoid that we have to use other words if we are talking about glyphs
instead of characters.


Werner




Re: [Groff] Re: groff: radical re-implementation

2000-10-20 Thread Werner LEMBERG

> > The same exists for Japanese and Chinese, especially for vertical
> > writing.
> 
> I think *ideograms* have fixed width everywhere.

Well, maybe.  But sometimes there is kerning.  Please consult Ken
Lunde's `CJKV Information Processing' for details.  Example:

   〇
   一
   〇


Werner

Re: [Groff] Re: groff: radical re-implementation

2000-10-20 Thread Tomohiro KUBOTA
Hi,

At Fri, 20 Oct 2000 14:45:51 +0200 (CEST),
Werner LEMBERG <[EMAIL PROTECTED]> wrote:

> First of all:  We both mean the same, and we agree how to handle the
> problem in groff.  I'm only arguing about technical terms.
> 
> Another try.
> 
> Consider a PostScript font with its encoding vector.  You have a
> single glyph set which can map to multiple encodings.  My intention is
> to use the terms `set' and `encoding' in a consistent way -- I want to
> avoid that we have to use other words if we are talking about glyphs
> instead of characters.

I understand I am confused.  I have to confirm a few points:

1. Your 'charset' and 'encoding' are for troff or for preprocessor?
   I thought both of them are for preprocessor.  The preprocessor
   figures out the way to convert the input to UTF-8 from the information.
2. Which will the pre/postprocessors handle, characters or glyphs?
   Or, is it meaningless to distinguish the object for pre/post-
   processors is character or glyph? (since they handle concrete
   encodings such as Latin-1 and UTF-8.  If the implementation 
   is not affected, it will be meaningless to think about whether
   the Latin-1, UTF-8, and so on are codes for character or glyph.)
3. Your 'charset' is for glyph and 'encoding' is for character?
   I thought both of them are for character, since I thought both 
   of them are for preprocessor.
4. I though we were discussing on (tags in roff souce for) preprocessor.
   Is that right?



Is this chart right (for tty)?


   roff source in any encoding like '\(co' (character)
  |
  |  preprocessor
  V
   UTF-8 stream like u+00a9(character)
  |
  |  troff
  V
   glyph expression like 'co'  (glyph)
  |
  |  troff (continuing)
  V
   UTF-8 stream like u+00a9 or '(C)'   (character)
  |
  |  postprocessor
  V
   formatted text in any encoding  (character)


---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/




Re: [Groff] Re: groff: radical re-implementation

2000-10-20 Thread T. Kurt Bond
Werner LEMBERG writes:
> The `-a' option is almost useless today IMHO.  It will show a tty
> approximation of the typeset output:
> 
>   groff -a -man -Tdvi troff.man | less
> 
> It is *not* the right way to quickly select an ASCII device.  To
> override the used macros for the output character set we need a new
> option.
> 
> Using `-a' is comparable to dvi2tty or similar converters.

Exactly.  Sometimes `-a' is explained as producing "ascii" output, but
I think it's better understood as producing "approximate" output.  It
is essentially useful only for debugging and for temporary runs where
the side effects (.write output, .tm output) are more important than
having actual properly formatted pages for output.  In those cases
it's very useful, but in normal cases it's of no use at all.

Would it be useful to add to the texinfo documentation a not
explaining that `-a' should only be used for these situations?
-- 
T. Kurt Bond, [EMAIL PROTECTED]




Re: [Groff] Re: groff: radical re-implementation

2000-10-20 Thread Ted Harding
On 17-Oct-00 Werner LEMBERG wrote:
> Well, I insist that GNU troff doesn't support multi-byte encodings at
> all :-) troff itself should work on a glyph basis only.  It has to
> work with *glyph names*, be it CJK entities or whatever.  Currently,
> the conversion from input encoding to glyph entities and the further
> processing of glyphs is not clearly separated.  From a modular point
> of view it makes sense if troff itself is restricted to a single input
> encoding (UTF-8) which is basically only meant as a wrapper to glyph
> names (cf. \U'' to enter Unicode encoded characters).  Everything
> else should be moved to a preprocessor.

Hi Folks,

I have now managed to read this correspondence (16-20 Oct) and think
about it for a bit. Sorry to have been obliged to leave it on one
side until this week ended.

On the whole I go with the view that Werner has expressed, here and
in other mails. Also, I think that some people are not clear about
the distinction between "character" and "glyph" (not sure that I am,
in all cases, come to that ... ).

I would like to present an even more conservative view than Werner
has stated. And, by the way, in the following "troff" means the
main formatting program "gtroff". "groff" denotes the whole package.

A.1. At present troff accepts 8-bit input, i.e. recognises 256 distinct
entities in the input stream (with a small number of exceptions which
are "illegal").

It does not really matter that these are interpreted, by default, as
iso-latin-1. They could correspond to anything on your screen when you
are typing, and you can set up translation macros in troff to make them
correspond to anything else (using either the ".char" request or the
traditional ".tr" or the new ".trnt" requests).

A.2. The direct correspondence between input bytes and characters is
defined in the font files for the device. In addition, groups of bytes
(such as, represented in ASCII, "\[Do]") can be made to correspond to
specific characters named in the font files.

A.3. What gets printed or displayed is a "glyph" which is defined by the
current font definition for the device. (Even in English, a character
such as "A" could be printed as a Times-Roman "A" glyph, a Helvetica
BoldItalic "A" glyph, \ ZapfChancery-MediumItalic glyph, ... ).
Troff uses the glyph-metric information in the font file to compute
its formatting.

A.4. Troff is not, and was never intended to be, WYSIWIG. Its concept
is that you prepare an input stream (using whatever interface pleases
you, and if this shows you say kanji characters then that's fine,
so long as you don't expect troff to "see" them as kanji) which,
when interpreted by troff, produces printed/displayed output which
bear the marks that you want. I don't see anything wrong (except
possibly in ease of use) in creating an ASCII input stream in
order to generate Japanese output. Preparation of an output
stream to drive a device capable of rendering the output is
the job of the post-processor (and, provided you have installed
appropriate font definition files, I cannot think of anything
that would be beyond the PostScript device "devps").

A: It follows that troff is already language-independent, for all
languages whose typographic conventions can be achieved by the primitive
mechanisms already present in troff. For such languages, there is no
need to change troff at all. For some other languages, there are
minor extra requirements which would require small extensions to
troff which would not interact with existing mechanisms.

Major exceptions to language-independence, at present, include all
the "left-to-right" languages (Hebrew, Arabic, ... ). I have been
studying Dan Berry's implementation of "ffortid" ["ditroff" backwards]
which is a post-processor that allows right-to-left text to be
correctly printed. I believe that a port to groff is quite feasible.

Dan Berry has also done "triroff" [tri-directional troff] for traditional
UNIX troff which can in addition do the top-to-bottom printing for
Chinese etc. To my untutored eye, the results look OK. This could also be
ported to groff.

Extra complications can arise in some languages, such as special
hyphenation rules (as has been mentioned); presence or absence of
particular ligatures [and I think that troff's hard-wired set
of ligatures should be replaced by a user-definable set] (e.g. in
Turkish you never use "fi" ligature since this suppresses the
distinction between dotless-i and i-with-dot); some characters may
not end, or may not begin, a line; some characters have different glyphs
at the beginning, the middle, or the end of words; and so on.
The above are cases where minor extensions of troff are required,
but they do not interact with other features of troff and require
no radical re-implementation.

Some of the complications with specific languages (such as the extra
space separating punctuation marks in French) can be set up on
a language-specific basis by suitable macros, and require no change at
all in tro