Stefan Behnel wrote: > Steven Bethard wrote: >> If you want to parse invalid HTML, I strongly encourage you to look into >> BeautifulSoup. Here's the updated code: >> >> import ElementSoup # http://effbot.org/zone/element-soup.htm >> import cStringIO >> >> tree = ElementSoup.parse(cStringIO.StringIO(page2)) >> for a_node in tree.getiterator('a'): >> url = a_node.get('href') >> if url is not None: >> print url >> [snip] > > Here's an lxml version: > > from lxml import etree as et # http://codespeak.net/lxml > html = et.HTML(page2) > for href in html.xpath("//a/@href[string()]"): > print href > > Doesn't count as a 15-liner, though, even if you add the above HTML code to > it.
Definitely better than the HTMLParser code. =) Personally, I still prefer the xpath-less version, but that's only because I can never remember what all the line noise characters in xpath mean. ;-) STeVe -- http://mail.python.org/mailman/listinfo/python-list