Re: Parsing HTML?

Stefan Behnel Sat, 26 Apr 2008 15:03:26 -0700

Benjamin wrote:
> On Apr 6, 11:03 pm, Stefan Behnel <[EMAIL PROTECTED]> wrote:
>> Benjamin wrote:
>>> I'm trying to parse an HTML file.  I want to retrieve all of the text
>>> inside a certain tag that I find with XPath.  The DOM seems to make
>>> this available with the innerHTML element, but I haven't found a way
>>> to do it in Python.
>>     import lxml.html as h
>>     tree = h.parse("somefile.html")
>>     text = tree.xpath("string( some/[EMAIL PROTECTED] )")
>>
>> http://codespeak.net/lxml
>>
>> Stefan
> 
> I actually had trouble getting this to work.  I guess only new version
> of lxml have the html module, and I couldn't get it installed.  lxml
> does look pretty cool, though.


Yes, the above code requires lxml 2.x. However, older versions should allow
you to do this:

     import lxml.etree as et
     parser = etree.HTMLParser()
     tree = h.parse("somefile.html", parser)
     text = tree.xpath("string( some/[EMAIL PROTECTED] )")

lxml.html is just a dedicated package that makes HTML handling beautiful. It's
not required for parsing HTML and doing general XML stuff with it.

Stefan
--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing HTML?

Reply via email to