2010/4/1 Mister Yu <eryan...@gmail.com>: > hi experts, > > i m new to python, i m writing crawlers to extract data from some > chinese websites, and i run into a encoding problem. > > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' > which is encoded in "gb2312",
No! Instances of type 'unicode' (i.e. strings with a leading 'u') ***aren't encoded at all***. > but i have no idea of how to convert it > back to utf-8 To convert u'\xd6\xd0\xce\xc4' to UTF-8, do u'\xd6\xd0\xce\xc4'.encode('utf-8') > to re-create this one is easy: > > this will work > ============================ >>>> su = u"中文".encode('gb2312') >>>> su > u >>>> print su.decode('gb2312') > 中文 -> (same as the original string) > > ============================ > but this doesn't,why > =========================== >>>> su = u'\xd6\xd0\xce\xc4' >>>> su > u'\xd6\xd0\xce\xc4' >>>> print su.decode('gb2312') You can't decode a unicode string, it's already been decoded! One decodes a bytestring to get a unicode string. One **encodes** a unicode string to get a bytestring. So the last line of your example should be: print su.encode('gb2312') Only call .encode() on things of type 'unicode'. Only call .decode() on things of type 'str'. [When using Python 2.x that is. Python 3.x renames the types in question.] Cheers, Chris -- http://blog.rebertia.com -- http://mail.python.org/mailman/listinfo/python-list