On Apr 1, 7:22 pm, Chris Rebert <c...@rebertia.com> wrote: > 2010/4/1 Mister Yu <eryan...@gmail.com>: > > > hi experts, > > > i m new to python, i m writing crawlers to extract data from some > > chinese websites, and i run into a encoding problem. > > > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' > > which is encoded in "gb2312", > > No! Instances of type 'unicode' (i.e. strings with a leading 'u') > ***aren't encoded at all***. > > > but i have no idea of how to convert it > > back to utf-8 > > To convert u'\xd6\xd0\xce\xc4' to UTF-8, do > u'\xd6\xd0\xce\xc4'.encode('utf-8') > > > > > to re-create this one is easy: > > > this will work > > ============================ > >>>> su = u"中文".encode('gb2312') > >>>> su > > u > >>>> print su.decode('gb2312') > > 中文 -> (same as the original string) > > > ============================ > > but this doesn't,why > > =========================== > >>>> su = u'\xd6\xd0\xce\xc4' > >>>> su > > u'\xd6\xd0\xce\xc4' > >>>> print su.decode('gb2312') > > You can't decode a unicode string, it's already been decoded! > > One decodes a bytestring to get a unicode string. > One **encodes** a unicode string to get a bytestring. > > So the last line of your example should be: > print su.encode('gb2312') > > Only call .encode() on things of type 'unicode'. > Only call .decode() on things of type 'str'. > [When using Python 2.x that is. Python 3.x renames the types in question.] > > Cheers, > Chris > --http://blog.rebertia.com
hi, thanks for the tips. but i m still not very sure how to convert a unicode object ** u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be? thanks. sorry i m really new to python. -- http://mail.python.org/mailman/listinfo/python-list