On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote: > On Thu, 17 Apr 2008 20:57:21 -0700 (PDT) > hdante <[EMAIL PROTECTED]> wrote: > > > Don't use old 8-bit encodings. Use UTF-8. > > Yes, I'll try. But is a problem when I only want to read, not that I'm trying > to write or create the content. > To blame I suppose is Microsoft's commercial success. They won't adhere to > standars if that doesn't make sense for their business. > > I'll change the approach trying to filter the contents with htmllib and > mapping on my own those troubling characters. > Anyway this has been a very instructive dive into unicode for me, I've got > things cleared up now. > > Thanks to everyone for the great help. >
There are a number of code points (150 being one of them) that are used in cp1252, which are reserved for control characters in ISO-8859-1. Those characters will pretty much never be used in ISO-8859-1 documents. If you're expecting documents of both types coming in, test for the presence of those characters, and assume cp1252 for those documents. Something like: for c in control_chars: if c in encoded_text: unicode_text = encoded_text.decode('cp1252') break else: unicode_text = encoded_text.decode('latin-1') Note that the else matches the for, not the if. You can figure out the characters to match on by looking at the wikipedia pages for the encodings. Cheers, Cliff -- http://mail.python.org/mailman/listinfo/python-list