Benjamin wrote: > On Apr 6, 11:03 pm, Stefan Behnel <[EMAIL PROTECTED]> wrote: >> Benjamin wrote: >>> I'm trying to parse an HTML file. I want to retrieve all of the text >>> inside a certain tag that I find with XPath. The DOM seems to make >>> this available with the innerHTML element, but I haven't found a way >>> to do it in Python. >> import lxml.html as h >> tree = h.parse("somefile.html") >> text = tree.xpath("string( some/[EMAIL PROTECTED] )") >> >> http://codespeak.net/lxml >> >> Stefan > > I actually had trouble getting this to work. I guess only new version > of lxml have the html module, and I couldn't get it installed. lxml > does look pretty cool, though.
Yes, the above code requires lxml 2.x. However, older versions should allow you to do this: import lxml.etree as et parser = etree.HTMLParser() tree = h.parse("somefile.html", parser) text = tree.xpath("string( some/[EMAIL PROTECTED] )") lxml.html is just a dedicated package that makes HTML handling beautiful. It's not required for parsing HTML and doing general XML stuff with it. Stefan -- http://mail.python.org/mailman/listinfo/python-list