encoding in lxml

jasiu85 Mon, 03 Nov 2008 03:45:42 -0800

Hey,

I have a problem with character encoding in LXML. Here's how it goes:


I read an HTML document from a third-party site. It is supposed to be
in UTF-8, but unfortunately from time to time it's not. I parse the
document like this:

html_doc = HTML(string_with_document)

Then I retrieve some info from the document with XPath:

xpath_nodes = html_doc('/html/body/something')

Now I'm guaranteed that the xpath_nodes list contains only one
element. So I read it's content:

xpath_nodes[0].text

And I get exception here. The exception is coming from the text
property of an Element object. The problem is that the text contains a
non-utf8 character. LXML seems to be using strict decoding and I can't
find a way to make it ignore the error. Is there anything I can do to
retrieve the text without getting an exception?

Regards,

Mike
--
http://mail.python.org/mailman/listinfo/python-list

encoding in lxml

Reply via email to