Georg Baum wrote:
OK, so it is like that: Up to 4 bytes per code point are used for the currently defined 21 bits of UCS4, but UTF8 is designed in such a way that it is possible to encode all 36 bits of UCS4 with at most 6 bytes per code point.

Not really. Some years ago there was not yet a real limit in the Unicode specification for the number of code points (the theoretical limit was 2^31 if I remember correctly).

However, the limit has now been set to 2^20+2^16 code points. There is still a lot of space available, but there will _never_ be any more code points than 2^20+2^16 (also not in UCS-4!).

So by definition UTF-8 allows a maximum of 4 bytes per character. Any 5 or 6 byte sequences are invalid.

To summarize:

* UTF-8 uses 1-4 bytes (1 byte for US-ASCII, 2 bytes for other Latin characters, 3 bytes for Chinese etc. and 4 bytes for rare things).

* UTF-16 uses 2 bytes for Latin, Chinese etc. and 4 bytes for rare characters.

* UTF-32 has a fixed length of 4 bytes per character and is functionally equivalent to UCS-4.

Please keep things simple and call the encodings UTF-8, UTF-16 and UTF-32.

Joost

Reply via email to