Philipp Reichmuth <[EMAIL PROTECTED]> writes:

| | LGB>> One glyph that thakes 64 bits to encode...
>
| LGB> | But not for any *technical* purpose. For all purposes of string
| LGB> | processing, such as indexing, concatenation etc., this is *two*
| LGB> | characters, not one.
>
| LGB> Finding the length of the string...
>
| Sorry, I don't understand. The length of the string U+0065 U+0301
| certainly is 2, regardless of how the rendering engine displays this.
| Of course, the rendering engine should render it as "é" because U+0301
| is a combining character, but the string length is still 2.

Not if I want to count the number of characters in the document.

| Oh, but they *are* doing them for real... there's some 75,000
| characters encoded as of now, and some 55,000 of these are Chinese
| symbols.... On the Unicode list, they're pretty confident about 20
| bits being sufficient. (And since they have people who actually *know*
| something about the languages they're encoding, so I don't really feel
| bad about sharing their confidence, even though I don't have a clue
| about chinese. :-)) I mean, this does somehow remind me of this 640 kB
| thingy, so with 32 bits one is on the safe side ;-)

especially since we do not have any 20bit integer types.

-- 
        Lgb

Reply via email to