KO> The whole difference is that just as any Arab understands that various KO> combinations of glyphs are still separate characters, KO> it's kinda hard to persuade me at least that é is more than one character. The KO> whole point about combining characters (at least accents for roman alphabets, KO> cyryllic and what else) is that they combine with the main KO> character.
The main point of combining characters in Unicode is the notion that there is no sense in encoding every possible combination of accents on Latin characters that has ever appeared in print somewhere. Instead, the idea that combining accents are separate characters saves a couple thousand codepoints. KO> And it's not true that "such things" should be handled by the KO> underlying toolkit. They are handled well when it comes to KO> display, but they typically are handled bad when you implement KO> your own editor widget. What do you do when you press back-arrow KO> over something that visaully consists of one unicode character? KO> To unicode characters? Five unicode characters? Say you're KO> backspacing over an "a" with a vector accent. What do you expect KO> to happen? Yes. But that's life. You have to face that when you write an editor widget for a complex script like Arabic, with cursor placement, deletion of vowel marks and the like. The only difference is that people sooner or later come to realize that with the advent of freely combining accents, Latin script processing does no longer adhere to the simple one-character-per-display-cell model that we've inherited from text terminals - an assumption that has never worked for complex scripts anyway. The added complexity in an editor widget is marginal for Latin scripts: - on deletion, delete everything backwards up to and including the first non-combining character, - on accent insertion, find the first preceding non-combining character, get a list of all the accents following it, insert your accent, and resort the list according to some rule. - on cursor movement, always move the editing point to the next non-combining character. Remember that this work has been done already; for example, there are GPL'ed toolkits such as Pango (http://www.pango.org) that deal with just this: multi-lingual, Unicode-conformant text editing and output. Note that there is already some suggestions for this in the Unicode standard itself in the Implementation Guidelines (http://www.unicode.org/uni2book/ch05.pdf). The online version of the standard is http://www.unicode.org/uni2book/ (3.0) plus the diffs in 3.1 (http://www.unicode.org/reports/tr27/) and 3.2 (http://www.unicode.org/reports/tr28/). Because all this is a bit painful to read with all the diffs, they're currently putting it all together in a new version (4.0), but this should not be a problem as upward compatibility is a design principle. And in general, in case of doubt, the Unicode mailing list people are rather helpful. KO> What will happen in a "dumb" unicode-based editor? The vector accent will be removed. This is not really intuitive, but that's the present state of things with many Unicode editors. Say, you've got this combination: U+0065 U+0061 U+20D1 ( e, a, combining vector accent). With cursor between the first two, when you press Delete, the "a" gets deleted and the vector arrow hops over to the "e"; and with cursor between the last two, when you press Delete, the vector accent vanishes from the "a". Note that this is in no way different from how a dumb present-day editor handles ISO-encoded Arabic vowel marks. The only new thing is that any good editing widget with Unicode functionality has to provide some input provisions for Latin, too, not only for other complex scripts. KO> So for some languages/scripts a "glyph" and "character" are KO> assumed to be the same, and for some languages they are not KO> assumed to be the same. Nope. A "glyph" is a visual entity. "é" is a glyph, and "é" with a combining macron and tilde above and dot below is still, basically, a glyph. Glyphs are entities of visualization, characters of representation. KO> I presume that an encoding internal to LyX that would have one-to-one mapping KO> between "character spaces" and 32-bit-entities would be of big advantage. I KO> don't mean that a "character space" is a fixed amount of space in pixels. I KO> just mean that when you backspace or left-arrow, one backspace or one-arrow KO> should jump around one 32-bit entity in the buffer. I don't like the idea of a document-specific encoding at all, what about copying & pasting between documents? Between applications? Lots of conversion, both between different document-specific encodings as well as to and from Unicode just for the clipboard. Reminds me of Emacs with its Mule character set where you can either configure the clipboard to be compatible with other X applications _or_ to preserve multilingual characters, but not both at a time. The question is, has there got to be a 1:1 mapping from visual screen positions (or, in terminal-speak, "character cells") to the backing store at all? Is it such a PITA if this is not the case, or have we just over the years gotten used to the assumption that every screen cell should correspond to exactly one character? How about, instead, defining some intelligent behaviour on the editing side, as outlined above? (Since this needs to be done anyway for Arabic, Hebrew etc., where the 1:1 mapping assumption doesn't work anyway, even for Hebrew with its combining vowel marks) Even in such a laymans' character model, it would certainly be necessary to have more complex functionality anyway, if just to be able to maintain the existing Hebrew and Arabic support. And you still need the extra functionality for Latin too, in case you want to use a combining accent with any of these layman's characters. KO> Consider string searching: if you have a document which has "é" KO> encoded as a single unicode character, but you have entered it KO> into the search box as two characters that still end up being KO> (and meaning) just an "é", what kind of behaviour do you expect KO> with raw unicode data? I can tell you that what you expect is KO> nothing of what will happen: what happens is that QString (or any KO> other Unicode-based) implementation will simply not match (or KO> that's what I'm worried about)! Only an implementation which KO> bases on a customary concept of a "letter" or "layman's character" KO> will work. That's why Unicode defines normalization forms and recommends that data is stored and/or compared in a normalized form, however it is entered; that is: - either decompose all composite characters, such as "é" or "ä", into their components on input [probably easiest] - or compose all accent combinations as far as possible, such as "e" + ACUTE ACCENT into "é", and define an order on all combining marks to have them comparable. On input, both are trivial to do. On pasting & file import, both are a bit of work, but generally in the linear range. Note that this problem is not specific to text processing at all; a database would have the same problem on field comparison. (Some of this is detailed in http://www.unicode.org/uni2book/ch03.pdf, but it's rather arcane in its language. There is a special document on normalization forms available from http://www.unicode.org/reports/tr15/) Cheers - Philipp Reichmuth mailto:[EMAIL PROTECTED] -- First snow, then silence / This thousand dollar screen dies / so beautifully