En Wed, 23 Jan 2008 10:40:14 -0200, Alnilam <[EMAIL PROTECTED]> escribió:
> Skipping past html validation, and html to xhtml 'cleaning', and > instead starting with the assumption that I have files that are valid > XHTML, can anyone give me a good example of how I would use _ htmllib, > HTMLParser, or ElementTree _ to parse out the text of one specific > childNode, similar to the examples that I provided above using regex? The diveintopython page is not valid XHTML (but it's valid HTML). Assuming it's property converted: py> from cStringIO import StringIO py> import xml.etree.ElementTree as ET py> tree = ET.parse(StringIO(page)) py> elem = tree.findall('//p')[4] py> py> # from the online ElementTree docs py> http://www.effbot.org/zone/element-bits-and-pieces.htm ... def gettext(elem): ... text = elem.text or "" ... for e in elem: ... text += gettext(e) ... if e.tail: ... text += e.tail ... return text ... py> print gettext(elem) The complete text is available online. You can read the revision history to see what's new. Updated 20 May 2004 -- Gabriel Genellina -- http://mail.python.org/mailman/listinfo/python-list