Abdelrazak Younes wrote:
utf8 can use up to 6 characters. 4 bytes would not have been enough to store 4 bytes of data plus the protocol necessary to decode utf8.

Unicode does not use the full 32 bits. There are only 2^20+2^16 code points, so actually the whole of Unicode would fit in 21 bits (there are no such integers of course). Therefore it is possible to encode Unicode data in 1-4 bytes as UTF-8.

By the way, UCS-4 and UTF-32 can be taken to be identical. So it is better to only use the names UTF-8, UTF-16 and UTF-32. Names like "UTF-8 to UCS-4" confuse people.

(In history when the Unicode specification did not yet contain this code point limit there was indeed a difference between UCS-4 and UTF-32 and the theoretical possibility of having characters that use more than 4 bytes in UTF-8. This is no longer the case.)

Hopefully this makes things clear.

Joost

Reply via email to