Philipp Reichmuth <[EMAIL PROTECTED]> writes: | | LGB>> One glyph that thakes 64 bits to encode... > | LGB> | But not for any *technical* purpose. For all purposes of string | LGB> | processing, such as indexing, concatenation etc., this is *two* | LGB> | characters, not one. > | LGB> Finding the length of the string... > | Sorry, I don't understand. The length of the string U+0065 U+0301 | certainly is 2, regardless of how the rendering engine displays this. | Of course, the rendering engine should render it as "é" because U+0301 | is a combining character, but the string length is still 2.
Not if I want to count the number of characters in the document. | Oh, but they *are* doing them for real... there's some 75,000 | characters encoded as of now, and some 55,000 of these are Chinese | symbols.... On the Unicode list, they're pretty confident about 20 | bits being sufficient. (And since they have people who actually *know* | something about the languages they're encoding, so I don't really feel | bad about sharing their confidence, even though I don't have a clue | about chinese. :-)) I mean, this does somehow remind me of this 640 kB | thingy, so with 32 bits one is on the safe side ;-) especially since we do not have any 20bit integer types. -- Lgb