Georg Baum wrote:
OK, so it is like that: Up to 4 bytes per code point are used for the
currently defined 21 bits of UCS4, but UTF8 is designed in such a way that
it is possible to encode all 36 bits of UCS4 with at most 6 bytes per code
point.
Not really. Some years ago there was not yet a real limit in the Unicode
specification for the number of code points (the theoretical limit was
2^31 if I remember correctly).
However, the limit has now been set to 2^20+2^16 code points. There is
still a lot of space available, but there will _never_ be any more code
points than 2^20+2^16 (also not in UCS-4!).
So by definition UTF-8 allows a maximum of 4 bytes per character. Any 5
or 6 byte sequences are invalid.
To summarize:
* UTF-8 uses 1-4 bytes (1 byte for US-ASCII, 2 bytes for other Latin
characters, 3 bytes for Chinese etc. and 4 bytes for rare things).
* UTF-16 uses 2 bytes for Latin, Chinese etc. and 4 bytes for rare
characters.
* UTF-32 has a fixed length of 4 bytes per character and is functionally
equivalent to UCS-4.
Please keep things simple and call the encodings UTF-8, UTF-16 and UTF-32.
Joost