Re: [Patch] optimize utf8_to_ucs4

Joost Verburg Sun, 29 Oct 2006 13:36:59 -0800

Abdelrazak Younes wrote:

utf8 can use up to 6 characters. 4 bytes would not have been enough tostore 4 bytes of data plus the protocol necessary to decode utf8.

Unicode does not use the full 32 bits. There are only 2^20+2^16 codepoints, so actually the whole of Unicode would fit in 21 bits (there areno such integers of course). Therefore it is possible to encode Unicodedata in 1-4 bytes as UTF-8.

By the way, UCS-4 and UTF-32 can be taken to be identical. So it isbetter to only use the names UTF-8, UTF-16 and UTF-32. Names like "UTF-8to UCS-4" confuse people.

(In history when the Unicode specification did not yet contain this codepoint limit there was indeed a difference between UCS-4 and UTF-32 andthe theoretical possibility of having characters that use more than 4bytes in UTF-8. This is no longer the case.)


Hopefully this makes things clear.

Joost

Re: [Patch] optimize utf8_to_ucs4

Reply via email to