On Apr 1, 9:31 pm, Stefan Behnel <stefan...@behnel.de> wrote: > Mister Yu, 01.04.2010 14:26: > > > On Apr 1, 8:13 pm, Chris Rebert wrote: > >> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4']) > >> unicode_string = gb2312_bytes.decode('gb2312') > >> utf8_bytes = unicode_string.encode('utf-8') #as you wanted > > Simplifying this hack a bit: > > gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1') > unicode_string = gb2312_bytes.decode('gb2312') > utf8_bytes = unicode_string.encode('utf-8') > > Although I have to wonder why you want a UTF-8 encoded byte string as > output instead of Unicode. > > >> If possible, I'd look at the code that's giving you that funky > >> "string" in the first place and see if it can be fixed to give you > >> either a proper bytestring or proper unicode string rather than the > >> bastardized mess you're currently having to deal with. > > > thanks for the great tips! it works like a charm. > > I hope you're aware that it's a big ugly hack, though. You should really > try to fix your input instead. > > > i m using the Scrapy project(http://doc.scrapy.org/intro/ > > tutorial.html) to write my crawler, when it extract data with xpath, > > it puts the chinese characters directly into the unicode object. > > My guess is that the HTML page you are parsing is broken and doesn't > specify its encoding. In that case, all that scrapy can do is guess, and it > seems to have guessed incorrectly. > > You should check if there is a way to tell scrapy about the expected page > encoding, so that it can return correctly decoded unicode strings directly, > instead of resorting to dirty hacks that may or may not work depending on > the page you are parsing. > > Stefan
Hi Stefan, i don't think the page is broken or somehow, you can take a look at the page http://www.7176.com/Sections/Genre/Comedy , it's kinda like a chinese IMDB rip off from what i can see from the source code of the page header, it contains the coding info: <HTML><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /><meta http-equiv="Content-Language" content="zh-CN" / ><meta content="all" name="robots" /><meta name="author" content="admin(at)7176.com" /><meta name="Copyright" content="www. 7176.com" /> <meta content="类别为 剧情 的电影列表 第1页" name="keywords" /><TITLE> 类别为 剧情 的电影列表 第1页</TITLE><LINK href="http://www.7176.com/images/ pro.css" rel=stylesheet></HEAD> maybe i should take a look at the source code of Scrapy, but i m just not more than a week's newbie of python. not sure if i can understand the source. earlier Chris's walk around is looking pretty well until it meets some string like this: >>> su = u'一二三四 12345 一二三四' >>> su u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db' >>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db >>> 12345 \u4e00\u4e8c\u4e09\u56db']) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: chr() arg not in range(256) the digis doesn't get encoded so it messes up the code. any ideas? once again, thanks everybody's help!!!! -- http://mail.python.org/mailman/listinfo/python-list