On 2006-04-07, Robin Haswell <[EMAIL PROTECTED]> wrote: > Okay I'm getting really frustrated with Python's Unicode handling, I'm > trying everything I can think of an I can't escape Unicode(En|De)codeError > no matter what I try. > > Could someone explain to me what I'm doing wrong here, so I can hope to > throw light on the myriad of similar problems I'm having? Thanks :-) > > Python 2.4.1 (#2, May 6 2005, 11:22:24) > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. >>>> import sys >>>> sys.getdefaultencoding() > 'utf-8' >>>> import htmlentitydefs >>>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a >>>> copyright symbol >>>> print char > © >>>> str = u"Apple" >>>> print str > Apple >>>> str + char > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: > unexpected code byte >>>> a = str+char > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: > unexpected code byte
Try this: import htmlentitydefs char = htmlentitydefs.entitydefs["copy"] char = unicode(char, "Latin1") str = u"Apple" print str print str + char htmlentitydefs.entitydefs is "A dictionary mapping XHTML 1.0 entity definitions to their replacement text in ISO Latin-1". So you get "char" back as a Latin-1 string. Then we use the builtin function unicode to make a unicode string (which doesn't have an encoding, as I understand it, it's just unicode). This can be added to u"Apple" and printed out. It prints out OK on a UTF-8 terminal, but you can print it in other encodings using encode: print (str + char).encode("Latin1") for example. For your search engine you should look at server headers, metatags, BOMs, and guesswork, in roughly that order, to determine the encoding of the source document. Convert it all to unicode (using builtin function unicode) and use that to build your indexes etc., and write results out in whatever you need to write it out in (probably UTF-8). HTH. -- http://mail.python.org/mailman/listinfo/python-list