On 06/10/2013 01:11 AM, Νικόλαος Κούρας wrote:
Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:
I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to
256, not above 256.
0 - 127, yes.
128 - 255 -> one byte of a multibyte code.
you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
I still havign troubl e understanding this.
Utf-8 characters are encoded in different sizes, NOT a single fixed number of
bytes.
The high _bits_ of the first byte define the number of bytes of the individual
character code.
(I'm copying this from Wikipedia...)
0xxxxxxx -> 1 byte
110xxxxx -> 2 bytes
1110xxxx -> 3 bytes
11110xxx -> 4 bytes
111110xx -> 5 bytes
1111110x -> 6 bytes
Notice that in the 1-byte version, since the high bit is always 0, only 7 bits are available for
the character code, and this is the standard 0-127 ASCII (and ASCII-compatible) code set.
Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but
instead its using 1 byte up to the first 127 value and then 2 bytes for
anyhtign above. Why?
As I indicated above, one bit is reserved as a flag to indicate that the code is one-byte code
and not a multibyte code, only 7 bits are available for the actual 1-byte (ASCII) code.
--
http://mail.python.org/mailman/listinfo/python-list