newbie with a encoding question, please help

2010-04-01 Thread Mister Yu
hi experts,

i m new to python, i m writing crawlers to extract data from some
chinese websites, and i run into a encoding problem.

i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
which is encoded in "gb2312", but i have no idea of how to convert it
back to utf-8

to re-create this one is easy:

this will work

>>> su = u"中文".encode('gb2312')
>>> su
u
>>> print su.decode('gb2312')
中文-> (same as the original string)


but this doesn't,why
===
>>> su = u'\xd6\xd0\xce\xc4'
>>> su
u'\xd6\xd0\xce\xc4'
>>> print su.decode('gb2312')
Traceback (most recent call last):
  File "", line 1, in 
UnicodeEncodeError: 'ascii' codec can't encode characters in position
0-3: ordinal not in range(128)
===

thank you
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newbie with a encoding question, please help

2010-04-01 Thread Mister Yu
On Apr 1, 7:22 pm, Chris Rebert  wrote:
> 2010/4/1 Mister Yu :
>
> > hi experts,
>
> > i m new to python, i m writing crawlers to extract data from some
> > chinese websites, and i run into a encoding problem.
>
> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> > which is encoded in "gb2312",
>
> No! Instances of type 'unicode' (i.e. strings with a leading 'u')
> ***aren't encoded at all***.
>
> > but i have no idea of how to convert it
> > back to utf-8
>
> To convert u'\xd6\xd0\xce\xc4' to UTF-8, do 
> u'\xd6\xd0\xce\xc4'.encode('utf-8')
>
>
>
> > to re-create this one is easy:
>
> > this will work
> > 
> >>>> su = u"中文".encode('gb2312')
> >>>> su
> > u
> >>>> print su.decode('gb2312')
> > 中文    -> (same as the original string)
>
> > 
> > but this doesn't,why
> > ===
> >>>> su = u'\xd6\xd0\xce\xc4'
> >>>> su
> > u'\xd6\xd0\xce\xc4'
> >>>> print su.decode('gb2312')
>
> You can't decode a unicode string, it's already been decoded!
>
> One decodes a bytestring to get a unicode string.
> One **encodes** a unicode string to get a bytestring.
>
> So the last line of your example should be:
> print su.encode('gb2312')
>
> Only call .encode() on things of type 'unicode'.
> Only call .decode() on things of type 'str'.
> [When using Python 2.x that is. Python 3.x renames the types in question.]
>
> Cheers,
> Chris
> --http://blog.rebertia.com

hi, thanks for the tips.

but i m still not very sure how to convert a unicode object  **
u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?

thanks.

sorry i m really new to python.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newbie with a encoding question, please help

2010-04-01 Thread Mister Yu
On Apr 1, 8:13 pm, Chris Rebert  wrote:
> On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu  wrote:
> > On Apr 1, 7:22 pm, Chris Rebert  wrote:
> >> 2010/4/1 Mister Yu :
> >> > hi experts,
>
> >> > i m new to python, i m writing crawlers to extract data from some
> >> > chinese websites, and i run into a encoding problem.
>
> >> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
> >> > which is encoded in "gb2312",
> 
> > hi, thanks for the tips.
>
> > but i m still not very sure how to convert a unicode object  **
> > u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?
>
> Ah, my apologies! I overlooked something (sorry, it's early in the
> morning where I am).
> What you have there is ***really*** screwy. It's the 2 Chinese
> characters, encoded in gb2312, and then somehow cast *directly* into a
> 'unicode' string (which ought never to be done).
>
> In answer to your original question (after some experimentation):
> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
> unicode_string = gb2312_bytes.decode('gb2312')
> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
>
> If possible, I'd look at the code that's giving you that funky
> "string" in the first place and see if it can be fixed to give you
> either a proper bytestring or proper unicode string rather than the
> bastardized mess you're currently having to deal with.
>
> Apologies again and Cheers,
> Chris
> --http://blog.rebertia.com

Hi Chris,

thanks for the great tips! it works like a charm.

i m using the Scrapy project(http://doc.scrapy.org/intro/
tutorial.html) to write my crawler, when it extract data with xpath,
it puts the chinese characters directly into the unicode object.

thanks again chris, and have a good april fool day.

Cheers,
Yu
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: newbie with a encoding question, please help

2010-04-01 Thread Mister Yu
On Apr 1, 9:31 pm, Stefan Behnel  wrote:
> Mister Yu, 01.04.2010 14:26:
>
> > On Apr 1, 8:13 pm, Chris Rebert wrote:
> >> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
> >> unicode_string = gb2312_bytes.decode('gb2312')
> >> utf8_bytes = unicode_string.encode('utf-8') #as you wanted
>
> Simplifying this hack a bit:
>
>  gb2312_bytes = u'\xd6\xd0\xce\xc4'.encode('ISO-8859-1')
>  unicode_string = gb2312_bytes.decode('gb2312')
>  utf8_bytes = unicode_string.encode('utf-8')
>
> Although I have to wonder why you want a UTF-8 encoded byte string as
> output instead of Unicode.
>
> >> If possible, I'd look at the code that's giving you that funky
> >> "string" in the first place and see if it can be fixed to give you
> >> either a proper bytestring or proper unicode string rather than the
> >> bastardized mess you're currently having to deal with.
>
> > thanks for the great tips! it works like a charm.
>
> I hope you're aware that it's a big ugly hack, though. You should really
> try to fix your input instead.
>
> > i m using the Scrapy project(http://doc.scrapy.org/intro/
> > tutorial.html) to write my crawler, when it extract data with xpath,
> > it puts the chinese characters directly into the unicode object.
>
> My guess is that the HTML page you are parsing is broken and doesn't
> specify its encoding. In that case, all that scrapy can do is guess, and it
> seems to have guessed incorrectly.
>
> You should check if there is a way to tell scrapy about the expected page
> encoding, so that it can return correctly decoded unicode strings directly,
> instead of resorting to dirty hacks that may or may not work depending on
> the page you are parsing.
>
> Stefan

Hi Stefan,

i don't think the page is broken or somehow, you can take a look at
the page http://www.7176.com/Sections/Genre/Comedy  , it's kinda like
a chinese IMDB rip off

from what i can see from the source code of the page header, it
contains the coding info:
 
类别为 剧情 的电影列表 第1页http://www.7176.com/images/
pro.css" rel=stylesheet>

maybe i should take a look at the source code of Scrapy, but i m just
not more than a week's newbie of python. not sure if i can understand
the source.

earlier Chris's walk around is looking pretty well until it meets some
string like this:
>>> su = u'一二三四 12345 一二三四'
>>> su
u'\u4e00\u4e8c\u4e09\u56db 12345 \u4e00\u4e8c\u4e09\u56db'
>>> gb2312_bytes = ''.join([chr(ord(c)) for c in u'\u4e00\u4e8c\u4e09\u56db 
>>> 12345 \u4e00\u4e8c\u4e09\u56db'])
Traceback (most recent call last):
  File "", line 1, in 
ValueError: chr() arg not in range(256)

the digis doesn't get encoded so it messes up the code.

any ideas?

once again, thanks everybody's help

-- 
http://mail.python.org/mailman/listinfo/python-list