Re: Character encoding... latin1 to utf8?

Karen Tracey Sat, 06 Dec 2008 10:37:05 -0800

On Sat, Dec 6, 2008 at 11:10 AM, Rob Hudson <[EMAIL PROTECTED]> wrote:


> [snip debug info]
>
> Now instead of \x95 I get \u2022 (which is a bullet).
>
> From here I'm not sure what the best way to proceed is... do I want
> the \u2022 version instead, in which case, should I not pass in
> unicode=True and manually decode each column?


What you've got in your DB is actually cp1252 (although MySQL calls it
latin1) data.  The values assigned to 0x95 and 0x92 in cp1252 are bullet and
curly apostrophe.  What you want in your unicode strings are the \u2022 and
\u2019 versions, since these are the correct code point assignments for
bullet and curly apostrophe in unicode.  (Unicode \x95 is the message
waiting control character and \x92 is 'private use two' control character).

In case you care, why it is not working when you specify unicode=True to
MySQLdb, is, I believe, a combination two factors: first, the Python codec
MySQLdb chooses to use to decode the data coming from MySQL, and second how
that codec behaves in the face of technically invalid data.

First, MySQLdb apparently decides to use the latin1 Python codec to decode
the data coming from MySQL into a unicode string.  On the face of it this
seems like a reasonable choice, since after all MySQL reports that the data
is 'latin1'.  However if you read the MySQL docs what they call 'latin1' is
really 'cp1252': (
http://dev.mysql.com/doc/refman/5.0/en/charset-we-sets.html):

MySQL's latin1 is the same as the Windows cp1252 character set. This means
it is the same as the official ISO 8859-1 or IANA (Internet Assigned Numbers
Authority) latin1, except that IANA latin1 treats the code points between
0x80 and 0x9f as "undefined," whereas cp1252, and therefore MySQL's latin1,
assign characters for those positions. For example, 0x80 is the Euro sign.

So, MySQL allows bytes in 'latin1' strings that are technically 'not
assigned', and assumes they have their cp1252-assigned meanings.  MySQLdb
uses the Python latin1 codec for data MySQL reports to be latin1 (though
MySQLdb might have better chosen cp1252 here, I think, given MySQL clearly
documents that they really mean cp1252 when they say latin1). The Python
latin1 codec, however, does not assume 'unassigned' latin1 code points have
their cp1252-assigned values.  Rather it assumes they have their
unicode-assigned values, and passes them through unscathed and without
error.  So your cp1252 bullets turn into unicode message waiting control
characters because MySQL assumes the unassigned latin1 \x95 byte has its
cp1252-assigned meaning while Python assumes it has its unicode-assigned
meaning.

Given your data is really cp1252, you need to use the cp1252 codec to decode
it.  You can see how this works better than the latin1 code in a Python
shell:

>>> x = 'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252
bytestring'
>>> ulatin1 = x.decode('latin1')
>>> ulatin1
u'Bullet ->\x95<- and curly apostrophe ->\x92<- in a cp1252 bytestring'
>>> print ulatin1
Bullet ->•<- and curly apostrophe ->'<- in a cp1252 bytestring
>>> ucp1252 = x.decode('cp1252')
>>> ucp1252
u'Bullet ->\u2022<- and curly apostrophe ->\u2019<- in a cp1252 bytestring'
>>> print ucp1252
Bullet ->•<- and curly apostrophe ->'<- in a cp1252 bytestring
>>>


> I'm partly thinking that since this is a one-time operation (actually,
> it's a many one-time operation until we're ready to switch over to the
> new site), I could scan for any "\x" characters and manually replace
> them.  There are likely only a handful as in the above.  But how does
> one scan and replace these so the output is correct?
>

You could also just convert the character set used on the MySQL side:

http://dev.mysql.com/doc/refman/5.0/en/charset-conversion.html

Presumably since MySQL knows it really means cp1252 for stuff it calls
latin1, it would convert properly to utf-8 when you told it to.  You'd
sidestep the issues you've hit with 'latin1' meaning different things to
different pieces of software.

Karen

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-users@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en
-~----------~----~----~----~------~----~------~--~---

Re: Character encoding... latin1 to utf8?

Reply via email to