Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Tomohiro KUBOTA
Hi,

At Fri, 20 Oct 2000 20:32:17 +0100 (BST),
(Ted Harding) <[EMAIL PROTECTED]> wrote:

> It does not really matter that these are interpreted, by default, as
> iso-latin-1. They could correspond to anything on your screen when you
> are typing, and you can set up translation macros in troff to make them
> correspond to anything else (using either the ".char" request or the
> traditional ".tr" or the new ".trnt" requests).

Since I am not familiar with the internals of troff, I don't know
this works well for multibyte encodings such as EUC-*, UTF-8, and so
on.  However, there are encodings which have shift state with escape
sequence, like ISO-2022-JP, ISO-2022-CN, ISO-2022-INT-1, and so on.
Such encodings cannot treated by ".char" .  Do you have any positive
reason not to support such encodings, even when such encodings can
easily supported using standard "locale" technology?

We have to prepare many font files for every encodings in the world.
It is inefficient.  And more, what is the mechanism to choose proper
font file for desired encoding?

I think the answer is "locale" technology.  Groff can check LC_CTYPE
category.  Thus, all what a user has to do is to set LANG, LC_CTYPE,
or LC_ALL environmental for various softwares to work on needed
encoding.

However, I am interested in how Groff 1.16 works for UTF-8 input.
I could not find any code for UTF-8 input, though I found a code for
UTF-8 output in src/devices/grotty/tty.cc .  Am I missing something?
(Of course /font/devutf8/* has no implementation of UTF-8 encoding,
though it seems to have a table for glyph names -> UCS-2.)


> B: Troff should be able to cope with multi-lingual documents, where
> several different languages occur in the same document. I do NOT
> believe that the right way to do this is to extend troff's capacity
> to recognise thousands of different input encodings covering all the
> languages which it might be called upon to typeset (e.g. by Unicode or
> the like).

This is a confusion of language and encoding.  I think UTF-8 support
can and should be achieved via locale technology.  By using locale
technology, a software can be written in encoding-independent way.
The software can support any encodings including UTF-8.  Why
do we have to hard-code UTF-8, though we can use locale technology
to support any encodings including UTF-8 ?

I agree that we should not extend troff to recognize thousands of
encodings if we have to hard-code every encodings.  But it is not true.
Again, do you have any positive reason not to support encodings other
than UTF-8 ?

We have many systems using encodings such as EUC-* and ISO-2022-*.
Some systems will migrate to UTF-8 soon.  Some will migrate after
a certain time.  Users using UTF-8 and EUC-* may share a system.
Some will never migrate into UTF-8.  Groff should support all of them.
I think you will also have trouble if you cannot use 8-bit encodings.


> C: Error messages and similar communications with the user (which
> have nothing directly to do with troff's real job) are irrelevant to
> the question of revising groff. If people would like these to appear in
> their own language then I'm sure it can be arranged in a way which
> would require no change whatever in the fundamental workings of troff.

This is a different topic.  You are right.  We can use gettext to
handle translation without radical re-implementation of troff nor
Groff system.  However, IMO, this topic is much less important.

You are reminded of this topic because I use a word "locale", aren't you?
Sure, locale is also used to change language for messages
(LC_MESSAGES category).  However, what I am discussing is LC_CTYPE 
category.

Supporting "locale" technology (especially LC_CTYPE category) is
important so that Groff can work well with other softwares.


P.S. to CHOI Junho:
I think you'd better subscribe [EMAIL PROTECTED] or you may lose messages.
I'd like you not to leave this discussion because you are also a
multibyte-language speaker.

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/




Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Tomohiro KUBOTA
Hi,

At Fri, 20 Oct 2000 14:14:44 +0200 (CEST),
Werner LEMBERG <[EMAIL PROTECTED]> wrote:

> > I think *ideograms* have fixed width everywhere.
> 
> Well, maybe.  But sometimes there is kerning.  Please consult Ken
> Lunde's `CJKV Information Processing' for details.  Example:
> 
>〇
>一
>〇

Wow!  This is the first time I received a Japanese mail from
non-Japanese speaker.

Everyone, please notice the header of 
> Content-Type: Text/Plain; charset=iso-2022-jp
> Content-Transfer-Encoding: 7bit
from Werner's mail.  (This mail will have the same header
since I cited the Japanese characters.)

BTW, you may be right, since I am not an expert on typesetting.
However, I prefer fixed-width ideograms even for such cases.
I have never seen any printed matters with non-fixed-width
ideograms even for such cases.

I have seen the book at a few bookstores (not translated in Japanese
yet).  Though I am interested in it, the book is too thick for me...
(Even if the book were in Japanese, it would be heavy for me.)

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/

Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Ted Harding
On 21-Oct-00 Tomohiro KUBOTA wrote:
> Hi,
> 
> At Fri, 20 Oct 2000 20:32:17 +0100 (BST),
> (Ted Harding) <[EMAIL PROTECTED]> wrote:
> 
>> B: Troff should be able to cope with multi-lingual documents, where
>> several different languages occur in the same document. I do NOT
>> believe that the right way to do this is to extend troff's capacity
>> to recognise thousands of different input encodings covering all the
>> languages which it might be called upon to typeset (e.g. by Unicode or
>> the like).
> 
> This is a confusion of language and encoding.  I think UTF-8 support
> can and should be achieved via locale technology.  By using locale
> technology, a software can be written in encoding-independent way.
> The software can support any encodings including UTF-8.  Why
> do we have to hard-code UTF-8, though we can use locale technology
> to support any encodings including UTF-8 ?

To bring the argument I am trying to present face-to-face with the
point you seem to be trying to make:

Someone writing a document about Middle Eastern and related literatures
may wish to use the Arabic, Persian, Hebrew, Turkish (all of which have
different scripts), and also various Central Asian languages (such as
Turkmen) which are often written in Cyrillic or in a variant of Cyrillic.

Can you explain how this could be handled "via locale technology"
in a single document?

Even in its present state, groff can handle such material in a
straightforward way (though with extra work from the user in order to
make the right-to-left languages work correctly -- the basic method would
be to get it formatted correctly with these languages input and printed as
left-to-right, and then edit the input file, using the formatted output
for reference, to reverse the input sequences on a line-by-line basis).

[One of the dangers which I fear if groff were re-structured on a
"locale" basis, or similar mechanism. is that its flexibility, indeed in
principle its universality, would be compromised and limited by the
constraints of that mechanism. It is perhaps not recognised widely
eough that groff, in its present state, is capable of being greatly
extended -- by means of user-defined macros, preprocessors, and
post-processors -- without fundamental change to troff.]

With best wishes,
Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 284 7749
Date: 21-Oct-00   Time: 15:39:24
-- XFMail --




Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Tomohiro KUBOTA
Hi,

At Sat, 21 Oct 2000 15:39:24 +0100 (BST),
(Ted Harding) <[EMAIL PROTECTED]> wrote:

> Someone writing a document about Middle Eastern and related literatures
> may wish to use the Arabic, Persian, Hebrew, Turkish (all of which have
> different scripts), and also various Central Asian languages (such as
> Turkmen) which are often written in Cyrillic or in a variant of Cyrillic.
> 
> Can you explain how this could be handled "via locale technology"
> in a single document?

In UTF-8 or ISO-2022 locale, if the OS supports it.
For example, "en_US.UTF-8" .   "en_US" part can be anything which
the OS supports.

Please consult:
  "docs.sun.com: Unicode Support in the Solaris Operiting Environment"
  
http://docs.sun.com/ab2/coll.651.1/SOLUNICOSUPPT/@Ab2TocView?Ab2Lang=C&Ab2Enc=iso-8859-1
and you will understand how UTF-8 can be used in "locale" technology.

Ok, the preprocessor which I am writing will support both "locale" mode
and conventional mode (for compatibility).  The conventional mode will
support Latin-1, EBCDIC, and UTF-8.  Thus, you can use UTF-8 even on
OSes which don't support UTF-8 locale.


> [One of the dangers which I fear if groff were re-structured on a
> "locale" basis, or similar mechanism. is that its flexibility, indeed in
> principle its universality, would be compromised and limited by the
> constraints of that mechanism. It is perhaps not recognised widely
> eough that groff, in its present state, is capable of being greatly
> extended -- by means of user-defined macros, preprocessors, and
> post-processors -- without fundamental change to troff.]

Can you write a macro which enable locale-sensible file/tty I/O?

---
Tomohiro KUBOTA <[EMAIL PROTECTED]>
http://surfchem0.riken.go.jp/~kubota/




Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Werner LEMBERG

> Would it be useful to add to the texinfo documentation a note
> explaining that `-a' should only be used for these situations?

I've added some words, thanks.


Werner




Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Werner LEMBERG

> A.1. At present troff accepts 8-bit input, i.e. recognises 256
> distinct entities in the input stream (with a small number of
> exceptions which are "illegal").

We need at least 20 bit (for Unicode BMP + surrogates) and the special
characters.  A 32bit wide number is thus the right choice IMHO.

> It does not really matter that these are interpreted, by default, as
> iso-latin-1.

I plan to remove the hard-coded `charXXX' values, moving them to macro
files.

> A.2. The direct correspondence between input bytes and characters is
> defined in the font files for the device.

But this isn't the right place.  Input character stuff should not be
there at all.

> I don't see anything wrong (except possibly in ease of use) in
> creating an ASCII input stream in order to generate Japanese output.

Not everything can resp. should be handled on the glyph level, for
example hyphenation.

> Preparation of an output stream to drive a device capable of
> rendering the output is the job of the post-processor (and, provided
> you have installed appropriate font definition files, I cannot think
> of anything that would be beyond the PostScript device "devps").

As mentioned in another mail we have to extend the metric directives
to cope with the many CJK characters without make troff too slow.

> A: It follows that troff is already language-independent, for all
> languages whose typographic conventions can be achieved by the
> primitive mechanisms already present in troff. For such languages,
> there is no need to change troff at all. For some other languages,
> there are minor extra requirements which would require small
> extensions to troff which would not interact with existing
> mechanisms.

Correct.  The changes we are discussing only affects the input
character level and not the troff engine itself (except additional
typesetting features for CJK and possibly other languages).

> I think that troff's hard-wired set of ligatures should be replaced
> by a user-definable set.

Definitely.

> Some characters have different glyphs at the beginning, the middle,
> or the end of words; and so on.

Usually, such changes involve contextual analysis which I won't
implement.  In case a preprocessor is doing this, it has to directly
send glyph entities to troff.  So this isn't a problem.

> B: Troff should be able to cope with multi-lingual documents, where
> several different languages occur in the same document. I do NOT
> believe that the right way to do this is to extend troff's capacity
> to recognise thousands of different input encodings covering all the
> languages which it might be called upon to typeset (e.g. by Unicode
> or the like).

This is done by a preprocessor and not visible to troff itself.  troff
will see Unicode only.

> Troff's multi-character naming convention means that anything you
> could possibly need can be defined, and given a name in the troff
> input "character set" whenever you really need it, so long as you
> have the device resources to render the appropriate glyph.

There are only 256 `multi-characters' named `charXXX'.  Everything
else are glyph entities (even if they behave like a character in most
cases).  The reality is that groff doesn't really make a difference
between a character and a glyph, and it has high priority to me to
implement this distinction.  I'll probably start with renaming a lot
of troff internals.

> If you want to use a multi-byte encoding in your input-preparation
> software, you can pre-process this with a suitable filter to
> generate the troff input-sequences you need (I have done this with
> WordPerfect multinational characters, for instance, which are
> two-byte entities).

This filter will be the yet-to-come preprocessor.

> CONCLUSION: Troff certainly needs some extensions to cope with the
> typesetting demands of some languages (of which the major ones that
> I can think of have been mentioned above). I also believe that there
> are some features of troff which need to be changed in any case, but
> these has nothing to do with language or "locale".

Locales support only affects pre- and postprocessors.


Werner




Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Werner LEMBERG

> 1. Your 'charset' and 'encoding' are for troff or for preprocessor?

In general.  I want to define terms completely independent on any
particular program.  We have

  character set
  character encoding
  glyph set
  glyph encoding

>I thought both of them are for preprocessor.  The preprocessor
>figures out the way to convert the input to UTF-8 from the
>information.

A groff preprocessor will work as you have described.  Under the
assumption that you are talking about input characters, the term
`encoding' indeed implies the character set(s).  After some thinking I
have to correct myself: It is better to say that `EUC' is an `encoding
scheme' which describes which character ranges and how many bytes are
used.  Sorry for the confusion.

> 2. Which will the pre/postprocessors handle, characters or glyphs?

The preprocessor converts from characters to characters (i.e. to
Unicode), grotty + postprocessor convert glyph names back to Unicode
characters (using a hard-coded table), then from characters to
characters.  I don't know yet whether it makes sense to unify the
latter two programs.

> 3. Your 'charset' is for glyph and 'encoding' is for character?
>I thought both of them are for character, since I thought both 
>of them are for preprocessor.

My point was to make the distinction clear between `set' and
`encoding'.  Maybe it is only of academic interest, but it (hopefully)
helps to clear up the used terms.

> 4. I though we were discussing on (tags in roff souce for)
>preprocessor.  Is that right?

Yes.

>roff source in any encoding like '\(co' (character)
>   |
>   |  preprocessor
>   V
>UTF-8 stream like u+00a9(character)
>   |
>   |  troff
>   V
>glyph expression like 'co'  (glyph)
>   |
>   |  troff (continuing)
>   V

Here is missing a step:

 typeset output  (glyph)
|
|  grotty
V

>UTF-8 stream like u+00a9 or '(C)'   (character)
>   |
>   |  postprocessor
>   V
>formatted text in any encoding  (character)


Werner




Re: [Groff] Re: groff: radical re-implementation

2000-10-21 Thread Ted Harding
Hi Werner (and all)

Thanks for this clarifying explanation. I have a couple of comments,
one explanatory, the other which, I think, may point to the core
of the question.

On 21-Oct-00 Werner LEMBERG wrote:
>> Troff's multi-character naming convention means that anything you
>> could possibly need can be defined, and given a name in the troff
>> input "character set" whenever you really need it, so long as you
>> have the device resources to render the appropriate glyph.
> 
> There are only 256 `multi-characters' named `charXXX'.  Everything
> else are glyph entities (even if they behave like a character in most
> cases).  The reality is that groff doesn't really make a difference
> between a character and a glyph, and it has high priority to me to
> implement this distinction.  I'll probably start with renaming a lot
> of troff internals.

1. Perhaps I should clarify: by "multi-character naming convention"
I mean the fact that you can decide to use the sequence of ASCII
characters, for instance, "\[O-ogonek]" as the name of a "character".

In passing: I see no _logical_ distinction between using a string
of ASCII characters to name a "character", and using a string of
bytes which implements a UTF-8 encoding.

2. Perhaps it is a good point of view to see troff (gtroff) as an
engine which handles _glyphs_, not characters, in a given context of
typographic style and layout. The current glyph is defined by the current
point size, the current font, and the name of the "character" which is to
be rendered, and troff necessarily takes account of the metric information
associated with this glyph.

The fact that ASCII characters and the iso-latin-1 characters
corresponding to byte-values > 128 are (by default) the troff names of
"characters" in a group of European languages -- together with certain
other marks and symbols -- is logically (in my view) an irrelevant
coincidence which happens to be very convenient for people using these
languages; but it is not at all necessary. Nothing at all stops you from
defining

  .char a \*a

as the name of Greek "alpha", and so on, if you want to simply the typing
of input in a passage of Greek using an ASCII interface.

Logically, therefore, troff could be "neutral" about what the byte "a"
stands for. From that point of view, a troff which makes no assumptions
of this kind, amd which consults external tables about the meaning of
its input and about the characteristics of what output that input
implies, purely for the purpose of correct formatting, is perhaps the
pure ideal. And from that point of view, therefore, unifying the input
conventions on the basis of a comprehensive encoding (such as UTF-8
or Unicode is intended to become) would be a great step towards
attaining this neutrality.

However, I wish to think more about this issue.

Meanwhile, interested parties who have not yet studied it may find
the "UTF-8 and Unicode FAQ for Unix/Linux" by Markus Kuhn well worth
reading:

  http://www.cl.cam.ac.uk/~mgk25/unicode.html

By the way, your comment that hyphenation, for instance, is not a "glyph
question" is, I think, not wholly correct. Certainly, hyphenation _rules_
are not a glyph question: as well as being language-dependent, there may
also be "house rules" about it; these come under "typographic style" as
above. But the size of a hyphen and associated spacing are glyph issues,
and these may interact with where a hyphenation occurs or whether it
occurs at all, according to the rules.

An interesting debate!

Ted.


E-Mail: (Ted Harding) <[EMAIL PROTECTED]>
Fax-to-email: +44 (0)870 284 7749
Date: 21-Oct-00   Time: 23:47:03
-- XFMail --