newbie with a encoding question, please help
hi experts, i m new to python, i m writing crawlers to extract data from some chinese websites, and i run into a encoding problem. i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' which is encoded in "gb2312", but i have no idea of how to convert it back to utf-8 to re-create this one is easy: this will work >>> su = u"中文".encode('gb2312') >>> su u >>> print su.decode('gb2312') 中文-> (same as the original string) but this doesn't,why === >>> su = u'\xd6\xd0\xce\xc4' >>> su u'\xd6\xd0\xce\xc4' >>> print su.decode('gb2312') Traceback (most recent call last): File "", line 1, in UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) === thank you -- http://mail.python.org/mailman/listinfo/python-list
Re: newbie with a encoding question, please help
On Apr 1, 7:22 pm, Chris Rebert wrote: > 2010/4/1 Mister Yu : > > > hi experts, > > > i m new to python, i m writing crawlers to extract data from some > > chinese websites, and i run into a encoding problem. > > > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' > > which is encoded in "gb2312", > > No! Instances of type 'unicode' (i.e. strings with a leading 'u') > ***aren't encoded at all***. > > > but i have no idea of how to convert it > > back to utf-8 > > To convert u'\xd6\xd0\xce\xc4' to UTF-8, do > u'\xd6\xd0\xce\xc4'.encode('utf-8') > > > > > to re-create this one is easy: > > > this will work > > > >>>> su = u"中文".encode('gb2312') > >>>> su > > u > >>>> print su.decode('gb2312') > > 中文 -> (same as the original string) > > > > > but this doesn't,why > > === > >>>> su = u'\xd6\xd0\xce\xc4' > >>>> su > > u'\xd6\xd0\xce\xc4' > >>>> print su.decode('gb2312') > > You can't decode a unicode string, it's already been decoded! > > One decodes a bytestring to get a unicode string. > One **encodes** a unicode string to get a bytestring. > > So the last line of your example should be: > print su.encode('gb2312') > > Only call .encode() on things of type 'unicode'. > Only call .decode() on things of type 'str'. > [When using Python 2.x that is. Python 3.x renames the types in question.] > > Cheers, > Chris > --http://blog.rebertia.com hi, thanks for the tips. but i m still not very sure how to convert a unicode object ** u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be? thanks. sorry i m really new to python. -- http://mail.python.org/mailman/listinfo/python-list
Re: newbie with a encoding question, please help
On Apr 1, 8:13 pm, Chris Rebert wrote: > On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu wrote: > > On Apr 1, 7:22 pm, Chris Rebert wrote: > >> 2010/4/1 Mister Yu : > >> > hi experts, > > >> > i m new to python, i m writing crawlers to extract data from some > >> > chinese websites, and i run into a encoding problem. > > >> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4' > >> > which is encoded in "gb2312", > > > hi, thanks for the tips. > > > but i m still not very sure how to convert a unicode object ** > > u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be? > > Ah, my apologies! I overlooked something (sorry, it's early in the > morning where I am). > What you have there is ***really*** screwy. It's the 2 Chinese > characters, encoded in gb2312, and then somehow cast *directly* into a > 'unicode' string (which ought never to be done). > > In answer to your original question (after some experimentation): > gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4']) > unicode_string = gb2312_bytes.decode('gb2312') > utf8_bytes = unicode_string.encode('utf-8') #as you wanted > > If possible, I'd look at the code that's giving you that funky > "string" in the first place and see if it can be fixed to give you > either a proper bytestring or proper unicode string rather than the > bastardized mess you're currently having to deal with. > > Apologies again and Cheers, > Chris > --http://blog.rebertia.com Hi Chris, thanks for the great tips! it works like a charm. i m using the Scrapy project(http://doc.scrapy.org/intro/ tutorial.html) to write my crawler, when it extract data with xpath, it puts the chinese characters directly into the unicode object. thanks again chris, and have a good april fool day. Cheers, Yu -- http://mail.python.org/mailman/listinfo/python-list
Re: newbie with a encoding question, please help
On Apr 1, 9:31 pm, Stefan Behnel wrote: > Mister Yu, 01.04.2010 14:26: > > > On Apr 1, 8:13 pm, Chris Rebert wrote: > >> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4']) > >> unicode_string = gb2312_bytes.decode('gb2312') > >> utf8_bytes = unicode_string.encode('utf-8') #as you wanted > > Simplifying this hack a bit: > > gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1') > unicode_string = gb2312_bytes.decode('gb2312') > utf8_bytes = unicode_string.encode('utf-8') > > Although I have to wonder why you want a UTF-8 encoded byte string as > output instead of Unicode. > > >> If possible, I'd look at the code that's giving you that funky > >> "string" in the first place and see if it can be fixed to give you > >> either a proper bytestring or proper unicode string rather than the > >> bastardized mess you're currently having to deal with. > > > thanks for the great tips! it works like a charm. > > I hope you're aware that it's a big ugly hack, though. You should really > try to fix your input instead. > > > i m using the Scrapy project(http://doc.scrapy.org/intro/ > > tutorial.html) to write my crawler, when it extract data with xpath, > > it puts the chinese characters directly into the unicode object. > > My guess is that the HTML page you are parsing is broken and doesn't > specify its encoding. In that case, all that scrapy can do is guess, and it > seems to have guessed incorrectly. > > You should check if there is a way to tell scrapy about the expected page > encoding, so that it can return correctly decoded unicode strings directly, > instead of resorting to dirty hacks that may or may not work depending on > the page you are parsing. > > Stefan Hi Stefan, i don't think the page is broken or somehow, you can take a look at the page http://www.7176.com/Sections/Genre/Comedy , it's kinda like a chinese IMDB rip off from what i can see from the source code of the page header, it contains the coding info: 类别为 剧情 的电影列表 第1页http://www.7176.com/images/ pro.css" rel=stylesheet> maybe i should take a look at the source code of Scrapy, but i m just not more than a week's newbie of python. not sure if i can understand the source. earlier Chris's walk around is looking pretty well until it meets some string like this: >>> su = u'一二三四 12345 一二三四' >>> su u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db' >>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db >>> 12345 \u4e00\u4e8c\u4e09\u56db']) Traceback (most recent call last): File "", line 1, in ValueError: chr() arg not in range(256) the digis doesn't get encoded so it messes up the code. any ideas? once again, thanks everybody's help -- http://mail.python.org/mailman/listinfo/python-list