On Fri, 2008-04-18 at 10:28 +0100, [EMAIL PROTECTED] wrote:
> On Thu, 17 Apr 2008 20:57:21 -0700 (PDT)
> hdante <[EMAIL PROTECTED]> wrote:
> 
> >  Don't use old 8-bit encodings. Use UTF-8.
> 
> Yes, I'll try. But is a problem when I only want to read, not that I'm trying 
> to write or create the content.
> To blame I suppose is Microsoft's commercial success. They won't adhere to 
> standars if that doesn't make sense for their business.
> 
> I'll change the approach trying to filter the contents with htmllib and 
> mapping on my own those troubling characters.
> Anyway this has been a very instructive dive into unicode for me, I've got 
> things cleared up now.
> 
> Thanks to everyone for the great help.
> 

There are a number of code points (150 being one of them) that are used
in cp1252, which are reserved for control characters in ISO-8859-1.
Those characters will pretty much never be used in ISO-8859-1 documents.
If you're expecting documents of both types coming in, test for the
presence of those characters, and assume cp1252 for those documents.  

Something like:

for c in control_chars:
    if c in encoded_text:
        unicode_text = encoded_text.decode('cp1252')
        break
else:
    unicode_text = encoded_text.decode('latin-1')

Note that the else matches the for, not the if.

You can figure out the characters to match on by looking at the
wikipedia pages for the encodings.

Cheers,
Cliff


-- 
http://mail.python.org/mailman/listinfo/python-list

Reply via email to