On 06/10/2013 01:11 AM, Νικόλαος Κούρας wrote:
Τη Δευτέρα, 10 Ιουνίου 2013 10:51:34 π.μ. UTC+3, ο χρήστης Larry Hudson έγραψε:

I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 
256, not above 256.

0 - 127, yes.
128 - 255 -> one byte of a multibyte code.

you mean that in utf-8 for 1 character to be stored, we need 2 bytes?
I still havign troubl e understanding this.

Utf-8 characters are encoded in different sizes, NOT a single fixed number of 
bytes.
The high _bits_ of the first byte define the number of bytes of the individual 
character code.

(I'm copying this from Wikipedia...)
0xxxxxxx -> 1 byte
110xxxxx -> 2 bytes
1110xxxx -> 3 bytes
11110xxx -> 4 bytes
111110xx -> 5 bytes
1111110x -> 6 bytes

Notice that in the 1-byte version, since the high bit is always 0, only 7 bits are available for the character code, and this is the standard 0-127 ASCII (and ASCII-compatible) code set.

Since 2^8 = 256, utf-8 would need 1 byte to store the 1st 256 characters but 
instead its using 1 byte up to the first 127 value and then 2 bytes for 
anyhtign above.  Why?

As I indicated above, one bit is reserved as a flag to indicate that the code is one-byte code and not a multibyte code, only 7 bits are available for the actual 1-byte (ASCII) code.

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to