Re: [Patch] optimize utf8_to_ucs4

Joost Verburg Mon, 30 Oct 2006 11:49:03 -0800

Georg Baum wrote:

OK, so it is like that: Up to 4 bytes per code point are used for thecurrently defined 21 bits of UCS4, but UTF8 is designed in such a way thatit is possible to encode all 36 bits of UCS4 with at most 6 bytes per codepoint.

Not really. Some years ago there was not yet a real limit in the Unicodespecification for the number of code points (the theoretical limit was2^31 if I remember correctly).

However, the limit has now been set to 2^20+2^16 code points. There isstill a lot of space available, but there will _never_ be any more codepoints than 2^20+2^16 (also not in UCS-4!).

So by definition UTF-8 allows a maximum of 4 bytes per character. Any 5or 6 byte sequences are invalid.


To summarize:

* UTF-8 uses 1-4 bytes (1 byte for US-ASCII, 2 bytes for other Latincharacters, 3 bytes for Chinese etc. and 4 bytes for rare things).

* UTF-16 uses 2 bytes for Latin, Chinese etc. and 4 bytes for rarecharacters.

* UTF-32 has a fixed length of 4 bytes per character and is functionallyequivalent to UCS-4.


Please keep things simple and call the encodings UTF-8, UTF-16 and UTF-32.

Joost

Re: [Patch] optimize utf8_to_ucs4

Reply via email to