Philipp Reichmuth <[EMAIL PROTECTED]> writes: | KO> But it's probably very true that just using a 32-bit encoding with | KO> *mostly* one-to-one mapping between characters and dwords is easy | KO> to use. But not always easy to use. | | Do you have an counterexample of a Unicode character that can't be | mapped to a single UCS-4-encoded dword?
To easy I guess if you begin looking into combining characters. Not all combinations has its own code point. | Characters, spaces and unique cursor positions have very little to do | with each other. You have to distinguish between characters and | glyphs. Unicode operates on the character level, as does most of all | text processing. Characters are what is represented internally in the | backing store; there is no difference in representation for the | various Arabic contextual forms of a character. Unicode U+0645 ARABIC | LETTER MEEM will always be represented as U+0645, regardless of where | it appears in the word. >From 'man unicode': For example, the German character Umlaut-A ("Latin capital letter A with diaeresis") can either be represented by the precomposed UCS code 0x00c4, or alternatively as the combination of a normal "Latin capital letter A" followed by a "combining diaeresis": 0x0041 0x0308. | KO> - since that essentially breaks the simple "full information about one | KO> character/composite glyph per 32 bits" assumption, one could as well go with | KO> utf8, right? | | No. I agree. | With UCS4, one character will essentially always have the same width | (bit width, that is). Every Unicode character is 32 bits wide, with no | exception. Actually, it is the only Unicode format that offers this, | and this fixed-width property is the main reason why people use UCS-4 | at all. | | Composite glyphs are a completely different matter, they are just | glyphs that consist of two distinct Unicode characters, such as "e" + | "´" to form "é". One glyph that thakes 64 bits to encode... | The only assumption that is gone is that two characters should be | displayed in separate places on screen. But, as I said, this is a | toolkit problem that ideally we don't have to care about at all, and | it is not substantially different from using proportional fonts with | the present system, where you don't have a fixed one-to-one mapping | from character to screen position either. | | KO> It seems that utf8 is still more compact than 32 bits per unicode table entry, | KO> in the worst case. | | UTF-8 has a maximum character width of 4 bytes. 6. but only 4 are allowed as this stage since no unicode char points above 0x10ffff The utf-8 represention is ready for 6 byte encoded chars. -- Lgb