I'm guessing you're using Python 2.7 or something similar. Things are much different in Python 3.x
On 01/28/2012 02:47 AM, contro opinion wrote: > as far as i know > >>>> u'中国'.encode('utf-8') > '\xe4\xb8\xad\xe5\x9b\xbd' > > so,'\xe4\xb8\xad\xe5\x9b\xbd' is the utf-8 of '中国' No, it is the utf-8 encoding of the unicode string u'中国' That unicode string has two characters in it, which may take two bytes each or four bytes each, depending mainly on the platform your python was compiled against. So it's 4 bytes or 8. The encoded version happens to take six bytes, when encoded in utf8. It happens in this case that each of those unicode characters takes 3 bytes to encode. In a utf-8 encoding, a character may take anywhere from one to around six bytes to represent. >>>> u'中国'.encode('gbk') > '\xd6\xd0\xb9\xfa' > so,'\xd6\xd0\xb9\xfa' is the utf-8 of '中国' > (presumably the utf-8 above was just a typo on your part.) No, it is the gbk encoding of the unicode string. In this case it takes two bytes for each character. I don't know gbk, so I don't know what range of possibilities exist. >>>> u'中国' > u'\u4e2d\u56fd' > > what is the meaning of u'\u4e2d\u56fd'? > u'\u4e2d\u56fd' = \x4e2d\x56fd ?? > Here you can see the exact two unicode characters. The first character has a hex representation of 4e2d, and the second has a hex representation of 56fd. If you were on a platform that didn't have the fonts or keyboard layout for either of those characters, you could enter the string as u'\u4e2d\u56fd' and it would be exactly equivalent to entering the literal with those characters directly. For example, on my (English) keyboard, I have no easy way to enter in those unicode characters; I have been copy/pasting them between windows. Do you know how to interpret those literal strings? The u outside the quotes says the whole thing is a unicode string. That's a distinct type from a byte string, and it almost always has to be converted to a byte string before going out to console or a file, or whatever. When you say print mystring, if mystring is of type unicode, the unicode characters are encoded according to some rules established by your console handler (here's where I get pretty fuzzy), which it thinks will get them to the console display correctly. Inside the unicode string literal, you can have regular characters or escape sequences. For these two particular, Python's repr() function chooses to use the escape sequences. The backslash identifies it as an escape sequence. The u immediately after says that this particular escape sequence is a four-character hex representation. Those four hex digits (0-9 and a-f) represent a two byte number, which is the ord() of the unicode character. The whole concept of unicode is that it has enough code points that nearly all characters of nearly all languages can be uniquely represented. When you've got a unicode string, you can search it and substring it, and be sure that every operation deals with characters, and not some variable-length representation of characters. In Python 3.x, unicode is the default string type, and you have to use b'xxx' notation to explicitly ask for bytes. Some things become much simpler, and even more obvious in that environment. -- DaveA -- http://mail.python.org/mailman/listinfo/python-list