Robin Haswell wrote: > Okay I'm getting really frustrated with Python's Unicode handling, I'm > trying everything I can think of an I can't escape Unicode(En|De)codeError > no matter what I try.
If you follow a few relatively simple rules, the days of Unicode errors will be over. Let's take a look! > Could someone explain to me what I'm doing wrong here, so I can hope to > throw light on the myriad of similar problems I'm having? Thanks :-) > > Python 2.4.1 (#2, May 6 2005, 11:22:24) > [GCC 3.3.6 (Debian 1:3.3.6-2)] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> import sys > >>> sys.getdefaultencoding() > 'utf-8' Note that this only specifies the encoding assumed to be used in plain strings when such strings are used to create Unicode objects. For some applications this is sufficient, but where you may be dealing with many different character sets (or encodings), having a default encoding will not be sufficient. This has an impact below and in your wider problem. > >>> import htmlentitydefs > >>> char = htmlentitydefs.entitydefs["copy"] # this is an HTML © - a > >>> copyright symbol > >>> print char > © It's better here to use repr(char) to see exactly what kind of object it is (or just give the name of the variable at the prompt). For me, it's a plain string, despite htmlentitydefs defining the each name in terms of its "Unicode codepoint". Moreover, for me the plain string uses the "Latin-1" (or more correctly iso-8859-1) character set, and I imagine that you get the same result. > >>> str = u"Apple" > >>> print str > Apple > >>> str + char > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: > unexpected code byte Here, Python attempts to make a Unicode object from char, using the default encoding (which is utf-8) and finds that char is a plain string containing non-utf-8 character values, specifically a single iso-8859-1 character value. It consequently complains. This is quite unfortunate since you probably expected Python to give you the entity definition either as a Unicode object or a plain string of your chosen encoding. Having never used htmlentitydefs before, I can only imagine that it provides plain strings containing iso-8859-1 values in order to support "legacy" HTML processing (given that semi-modern HTML favours &#xx; entities, and XHTML uses genuine character sequences in the stated encoding), and that getting anything other than such strings might not be particularly useful. Anyway, what you'd do here is this: str + unicode(char, "iso-8859-1) Rule #1: if you have plain strings and you want them as Unicode, you must somewhere state what encoding those strings are in, preferably as you convert them to Unicode objects. Here, we can't rely on the default encoding being correct and must explicitly state a different encoding. Generally, stating the encoding is the right thing to do, rather than assuming some default setting that may differ across environments. Somehow, my default encoding is "ascii" not "utf-8", so your code would fail on my system by relying on the default encoding. [...] > Basically my app is a search engine - I'm grabbing content from pages > using HTMLParser and storing it in a database but I'm running in to these > problems all over the shop (from decoding the entities to calling > str.lower()) - I don't know what encoding my pages are coming in as, I'm > just happy enough to accept that they're either UTF-8 or latin-1 with > entities. Rule #2: get your content as Unicode as soon as possible, then work with it in Unicode. Once you've made your content Unicode, you shouldn't get UnicodeDecodeError all over the place, and the only time you then risk an UnicodeEncodeError is when you convert your content back to plain strings, typically for serialisation purposes. Rule #3: get acquainted with what kind of encodings apply to the incoming data. If you are prepared to assume that the data is either utf-8 or iso-8859-1, first try making Unicode objects from the data stating that utf-8 is the encoding employed, and only if that fails should you consider it as iso-8859-1, since an utf-8 string can quite happily be interpreted (incorrectly) as a bunch of iso-8859-1 characters but not vice versa; thus, you have a primitive means of validation. > Any help would be great, I just hope that I have a brainwave over the > weekend because I've lost two days to Unicode errors now. It's even worse > that I've written the same app in PHP before with none of these problems - > and PHP4 doesn't even support Unicode. Perhaps that's why you never saw any such problems, but have you looked at the quality of your data? Paul -- http://mail.python.org/mailman/listinfo/python-list