Re: newbie with a encoding question, please help

Chris Rebert Thu, 01 Apr 2010 05:16:42 -0700

On Thu, Apr 1, 2010 at 4:38 AM, Mister Yu <[email protected]> wrote:
> On Apr 1, 7:22 pm, Chris Rebert <[email protected]> wrote:
>> 2010/4/1 Mister Yu <[email protected]>:
>> > hi experts,
>>
>> > i m new to python, i m writing crawlers to extract data from some
>> > chinese websites, and i run into a encoding problem.
>>
>> > i have a unicode object, which looks like this u'\xd6\xd0\xce\xc4'
>> > which is encoded in "gb2312",
<snip>
> hi, thanks for the tips.
>
> but i m still not very sure how to convert a unicode object  **
> u'\xd6\xd0\xce\xc4 ** back to "中文" the string it supposed to be?


Ah, my apologies! I overlooked something (sorry, it's early in the
morning where I am).
What you have there is ***really*** screwy. It's the 2 Chinese
characters, encoded in gb2312, and then somehow cast *directly* into a
'unicode' string (which ought never to be done).

In answer to your original question (after some experimentation):
gb2312_bytes = ''.join([chr(ord(c)) for c in u'\xd6\xd0\xce\xc4'])
unicode_string = gb2312_bytes.decode('gb2312')
utf8_bytes = unicode_string.encode('utf-8') #as you wanted

If possible, I'd look at the code that's giving you that funky
"string" in the first place and see if it can be fixed to give you
either a proper bytestring or proper unicode string rather than the
bastardized mess you're currently having to deal with.

Apologies again and Cheers,
Chris
--
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: newbie with a encoding question, please help

Reply via email to