Philipp Reichmuth <[EMAIL PROTECTED]> writes:

| KO> But it's probably very true that just using a 32-bit encoding with
| KO> *mostly* one-to-one mapping between characters and dwords is easy
| KO> to use. But not  always easy to use.
| 
| Do you have an counterexample of a Unicode character that can't be
| mapped to a single UCS-4-encoded dword?

To easy I guess if you begin looking into combining characters.
Not all combinations has its own code point.

| Characters, spaces and unique cursor positions have very little to do
| with each other. You have to distinguish between characters and
| glyphs. Unicode operates on the character level, as does most of all
| text processing. Characters are what is represented internally in the
| backing store; there is no difference in representation for the
| various Arabic contextual forms of a character. Unicode U+0645 ARABIC
| LETTER MEEM will always be represented as U+0645, regardless of where
| it appears in the word.

>From 'man unicode':

        For example, the German
       character Umlaut-A ("Latin capital letter A with diaeresis") can either
       be  represented by the precomposed UCS code 0x00c4, or alternatively as
       the combination of a normal "Latin capital  letter  A"  followed  by  a
       "combining diaeresis": 0x0041 0x0308.

| KO> - since that essentially breaks the simple "full information about one
| KO> character/composite glyph per 32 bits" assumption, one could as well go with 
| KO> utf8, right?
| 
| No.

I agree.
 
| With UCS4, one character will essentially always have the same width
| (bit width, that is). Every Unicode character is 32 bits wide, with no
| exception. Actually, it is the only Unicode format that offers this,
| and this fixed-width property is the main reason why people use UCS-4
| at all.
| 
| Composite glyphs are a completely different matter, they are just
| glyphs that consist of two distinct Unicode characters, such as "e" +
| "´" to form "é".

One glyph that thakes 64 bits to encode... 

| The only assumption that is gone is that two characters should be
| displayed in separate places on screen. But, as I said, this is a
| toolkit problem that ideally we don't have to care about at all, and
| it is not substantially different from using proportional fonts with
| the present system, where you don't have a fixed one-to-one mapping
| from character to screen position either.
| 
| KO> It seems that utf8 is still more compact than 32 bits per unicode table entry, 
| KO> in the worst case.
| 
| UTF-8 has a maximum character width of 4 bytes.

6.

but only 4 are allowed as this stage since no unicode char points
 above 0x10ffff

The utf-8 represention is ready for 6 byte encoded chars.

-- 
        Lgb

Reply via email to