Robin Haswell wrote: > Okay I'm getting really frustrated with Python's Unicode handling, I'm > trying everything I can think of an I can't escape Unicode(En|De)codeError > no matter what I try.
Have you read any of the documentation about Python's Unicode support? E.g., http://effbot.org/zone/unicode-objects.htm > Could someone explain to me what I'm doing wrong here, so I can hope to > throw light on the myriad of similar problems I'm having? Thanks :-) > > Python 2.4.1 (#2, May 6 2005, 11:22:24) > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>>>import sys >>>>sys.getdefaultencoding() > > 'utf-8' How did this happen? It's supposed to be 'ascii' and not user-settable. >>>>import htmlentitydefs >>>>char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a >>>>copyright symbol >>>>print char > > © > >>>>str = u"Apple" >>>>print str > > Apple > >>>>str + char > > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: > unexpected code byte > >>>>a = str+char > > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: > unexpected code byte The values in htmlentitydefs.entitydefs are encoded in latin-1 (or are numeric entities which you still have to parse). So decode using the latin-1 codec. -- Robert Kern [EMAIL PROTECTED] "I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth." -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list