Hey, I have a problem with character encoding in LXML. Here's how it goes:
I read an HTML document from a third-party site. It is supposed to be in UTF-8, but unfortunately from time to time it's not. I parse the document like this: html_doc = HTML(string_with_document) Then I retrieve some info from the document with XPath: xpath_nodes = html_doc('/html/body/something') Now I'm guaranteed that the xpath_nodes list contains only one element. So I read it's content: xpath_nodes[0].text And I get exception here. The exception is coming from the text property of an Element object. The problem is that the text contains a non-utf8 character. LXML seems to be using strict decoding and I can't find a way to make it ignore the error. Is there anything I can do to retrieve the text without getting an exception? Regards, Mike -- http://mail.python.org/mailman/listinfo/python-list