I k nwo i have been a pain in the ass these days but this is the lats explanation i want from you, just to understand it completely.
>> Since 1 byte can hold up to 256 chars, why not utf-8 use 1-byte for >> values up to 256? >Because then how do you tell when you need one byte, and when you need >two? If you read two bytes, and see 0x4C 0xFA, does that mean two >characters, with ordinal values 0x4C and 0xFA, or one character with >ordinal value 0x4CFA? I mean utf-8 could use 1 byte for storing the 1st 256 characters. I meant up to 256, not above 256. >> UTF-8 and UTF-16 and UTF-32 >> I though the number beside of UTF- was to declare how many bits the >> character set was using to store a character into the hdd, no? >Not exactly, but close. UTF-32 is completely 32-bit (4 byte) values. >UTF-16 mostly uses 16-bit values, but sometimes it combines two 16-bit >values to make a surrogate pair. A surrogate pair is like itting for example Ctrl-A, which means is a combination character that consists of 2 different characters? Is this what a surrogate is? a pari of 2 chars? >UTF-8 uses 8-bit values, but sometimes >it combines two, three or four of them to represent a single code-point. 'a' to be utf8 encoded needs 1 byte to be stored ? (since ordinal = 65) 'α΄' to be utf8 encoded needs 2 bytes to be stored ? (since ordinal is > 127 ) 'a chinese ideogramm' to be utf8 encoded needs 4 byte to be stored ? (since ordinal > 65000 ) The amount of bytes needed to store a character solely depends on the character's ordinal value in the Unicode table? >UTF-8 solves this problem by reserving some values to mean "this byte, on >its own", and others to mean "this byte, plus the next byte, together", >and so forth, up to four bytes. Some of the utf-8 bits that are used to represent a character's ordinal value are actually been also used to seperate or join the ordinal values themselves? Can you give an example please? How there are beign seperated? >Computers are digital and work with numbers. So character 'A' <-> 65 (in decimal uses in charset's table) <-> 01011100 (as binary stored in disk) <-> 0xEF (as hex, when we open the file with a hex editor) Is this how the thing works? (above values are fictional) -- http://mail.python.org/mailman/listinfo/python-list