Here's what I'm trying to do: - scrape some html content from various sources
The issue I'm running to: - some of the sources have incorrectly encoded characters... for example, cp1252 curly quotes that were likely the result of the author copying and pasting content from Word I've searched and read for many hours, but have not found a solution for handling the case where the page author does not use the character encoding that they have specified. Things I have tried include encode()/decode(), and replacement lookup tables (i.e. something like http://groups-beta.google.com/group/comp.lang.python/browse_thread/thread/116158ad706dc7c1/11991de6ced3406b?q=python+html+parser+cp1252&_done=%2Fgroups%3Fq%3Dpython+html+parser+cp1252%26qt_s%3DSearch+Groups%26&_doneTitle=Back+to+Search&&d#11991de6ced3406b ) . However, I am still unable to convert the characters to something meaningful. In the case of the lookup table, this failed as all of the imporoperly encoded characters were returning as ? rather than their original encoding. I'm using urllib and htmllib to open, read, and parse the html fragments, Python 2.3 on OS X 10.3 Any ideas or pointers would be greatly appreciated. -Dylan Schiemann http://www.dylanschiemann.com/ -- http://mail.python.org/mailman/listinfo/python-list