Hi Mike, > I read an HTML document from a third-party site. It is supposed to be > in UTF-8, but unfortunately from time to time it's not.
There will be host of more lightweight solutions, but you can opt to sanizite incominhg HTML with HTML Tidy (python binding available). It will replace invalid UTF-8 bytes with U+FFFD. It will not guess a better encoding to use. If you are sure you don't have HTML sloppiness to correct but only the occasional wrong byte, even decoding (with fallback) and encoding using the standard codec package will do. Regards, Peter -- http://mail.python.org/mailman/listinfo/python-list