Okay I'm getting really frustrated with Python's Unicode handling, I'm trying everything I can think of an I can't escape Unicode(En|De)codeError no matter what I try.
Could someone explain to me what I'm doing wrong here, so I can hope to throw light on the myriad of similar problems I'm having? Thanks :-) Python 2.4.1 (#2, May 6 2005, 11:22:24) [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys >>> sys.getdefaultencoding() 'utf-8' >>> import htmlentitydefs >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a >>> copyright symbol >>> print char © >>> str = u"Apple" >>> print str Apple >>> str + char Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte >>> a = str+char Traceback (most recent call last): File "<stdin>", line 1, in ? UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: unexpected code byte >>> Basically my app is a search engine - I'm grabbing content from pages using HTMLParser and storing it in a database but I'm running in to these problems all over the shop (from decoding the entities to calling str.lower()) - I don't know what encoding my pages are coming in as, I'm just happy enough to accept that they're either UTF-8 or latin-1 with entities. Any help would be great, I just hope that I have a brainwave over the weekend because I've lost two days to Unicode errors now. It's even worse that I've written the same app in PHP before with none of these problems - and PHP4 doesn't even support Unicode. Cheers -Rob -- http://mail.python.org/mailman/listinfo/python-list