KO> The whole difference is that just as any Arab understands that various
KO> combinations of glyphs are still separate characters,
KO> it's kinda hard to persuade me at least that é is more than one character. The 
KO> whole point about combining characters (at least accents for roman alphabets, 
KO> cyryllic and what else) is that they combine with the main
KO> character.

The main point of combining characters in Unicode is the notion that
there is no sense in encoding every possible combination of accents on
Latin characters that has ever appeared in print somewhere. Instead,
the idea that combining accents are separate characters saves a couple
thousand codepoints.

KO> And it's not true that "such things" should be handled by the
KO> underlying toolkit. They are handled well when it comes to
KO> display, but they typically  are handled bad when you implement
KO> your own editor widget. What do you do when you press back-arrow
KO> over something that visaully consists of one  unicode character?
KO> To unicode characters? Five unicode characters? Say you're
KO> backspacing over an "a" with a vector accent. What do you expect
KO> to happen?

Yes. But that's life. You have to face that when you write an editor
widget for a complex script like Arabic, with cursor placement,
deletion of vowel marks and the like. The only difference is that
people sooner or later come to realize that with the advent of freely
combining accents, Latin script processing does no longer adhere to
the simple one-character-per-display-cell model that we've inherited
from text terminals - an assumption that has never worked for complex
scripts anyway.

The added complexity in an editor widget is marginal for Latin
scripts:

- on deletion, delete everything backwards up to and including the
first non-combining character,

- on accent insertion, find the first preceding non-combining
character, get a list of all the accents following it, insert your
accent, and resort the list according to some rule.

- on cursor movement, always move the editing point to the next
non-combining character.

Remember that this work has been done already; for example, there are
GPL'ed toolkits such as Pango (http://www.pango.org) that deal with
just this: multi-lingual, Unicode-conformant text editing and output.
Note that there is already some suggestions for this in the Unicode
standard itself in the Implementation Guidelines
(http://www.unicode.org/uni2book/ch05.pdf). The online version of the
standard is http://www.unicode.org/uni2book/ (3.0) plus the diffs in
3.1 (http://www.unicode.org/reports/tr27/) and 3.2
(http://www.unicode.org/reports/tr28/). Because all this is a bit
painful to read with all the diffs, they're currently putting it all
together in a new version (4.0), but this should not be a problem as
upward compatibility is a design principle. And in general, in case of
doubt, the Unicode mailing list people are rather helpful.

KO> What will happen in a "dumb" unicode-based editor?

The vector accent will be removed. This is not really intuitive, but
that's the present state of things with many Unicode editors. Say,
you've got this combination:

U+0065 U+0061 U+20D1  ( e, a, combining vector accent).

With cursor between the first two, when you press Delete, the "a" gets
deleted and the vector arrow hops over to the "e"; and with cursor
between the last two, when you press Delete, the vector accent
vanishes from the "a".

Note that this is in no way different from how a dumb present-day
editor handles ISO-encoded Arabic vowel marks. The only new thing is
that any good editing widget with Unicode functionality has to provide
some input provisions for Latin, too, not only for other complex
scripts.

KO> So for some languages/scripts a "glyph" and "character" are
KO> assumed to be the same, and for some languages they are not
KO> assumed to be the same.

Nope. A "glyph" is a visual entity. "é" is a glyph, and "é" with a
combining macron and tilde above and dot below is still, basically, a
glyph. Glyphs are entities of visualization, characters of
representation.

KO> I presume that an encoding internal to LyX that would have one-to-one mapping 
KO> between "character spaces" and 32-bit-entities would be of big advantage. I 
KO> don't mean that a "character space" is a fixed amount of space in pixels. I 
KO> just mean that when you backspace or left-arrow, one backspace or one-arrow 
KO> should jump around one 32-bit entity in the buffer.

I don't like the idea of a document-specific encoding at all, what
about copying & pasting between documents? Between applications? Lots
of conversion, both between different document-specific encodings as
well as to and from Unicode just for the clipboard.

Reminds me of Emacs with its Mule character set where you can either
configure the clipboard to be compatible with other X applications
_or_ to preserve multilingual characters, but not both at a time.

The question is, has there got to be a 1:1 mapping from visual screen
positions (or, in terminal-speak, "character cells") to the backing
store at all? Is it such a PITA if this is not the case, or have we
just over the years gotten used to the assumption that every screen
cell should correspond to exactly one character? How about, instead,
defining some intelligent behaviour on the editing side, as outlined
above? (Since this needs to be done anyway for Arabic, Hebrew etc.,
where the 1:1 mapping assumption doesn't work anyway, even for Hebrew
with its combining vowel marks) Even in such a laymans' character
model, it would certainly be necessary to have more complex
functionality anyway, if just to be able to maintain the existing
Hebrew and Arabic support.

And you still need the extra functionality for Latin too, in case you
want to use a combining accent with any of these layman's characters.

KO> Consider string searching: if you have a document which has "é"
KO> encoded as a single unicode character, but you have entered it
KO> into the search box as two  characters that still end up being
KO> (and meaning) just an "é", what kind of  behaviour do you expect
KO> with raw unicode data? I can tell you that what you  expect is
KO> nothing of what will happen: what happens is that QString (or any
KO> other Unicode-based) implementation will simply not match (or
KO> that's what I'm  worried about)! Only an implementation which
KO> bases on a customary concept of a "letter" or "layman's character"
KO> will work.

That's why Unicode defines normalization forms and recommends that
data is stored and/or compared in a normalized form, however it is
entered; that is:

- either decompose all composite characters, such as "é" or "ä", into
their components on input [probably easiest]

- or compose all accent combinations as far as possible, such as "e" +
ACUTE ACCENT into "é", and define an order on all combining marks to
have them comparable.

On input, both are trivial to do. On pasting & file import, both are a
bit of work, but generally in the linear range.

Note that this problem is not specific to text processing at all; a
database would have the same problem on field comparison.

(Some of this is detailed in http://www.unicode.org/uni2book/ch03.pdf,
but it's rather arcane in its language. There is a special document on
normalization forms available from http://www.unicode.org/reports/tr15/)

Cheers -
  Philipp Reichmuth                            mailto:[EMAIL PROTECTED]

--
First snow, then silence / This thousand dollar screen dies / so beautifully

Reply via email to