Steven Bethard wrote: > Rob Wolfe wrote: >> Steven Bethard <[EMAIL PROTECTED]> writes: >>> I'd hate to steer a potential new Python developer to a clumsier >> >> "clumsier"??? >> Try to parse this with your program: >> >> page2 = ''' >> <html><head><title>URLs</title></head> >> <body> >> <ul> >> <li><a href="http://domain1/page1">some page1</a></li> >> <li><a href="http://domain2/page2">some page2</a></li> >> </body></html> >> ''' > > If you want to parse invalid HTML, I strongly encourage you to look into > BeautifulSoup. Here's the updated code: > > import ElementSoup # http://effbot.org/zone/element-soup.htm > import cStringIO > > tree = ElementSoup.parse(cStringIO.StringIO(page2)) > for a_node in tree.getiterator('a'): > url = a_node.get('href') > if url is not None: > print url > >>> I know that the wiki page is supposed to be Python 2.4 only, but I'd >>> rather have no example than an outdated one. >> >> This example is by no means "outdated". > > Given the simplicity of the ElementSoup code above, I'd still contend > that using HTMLParser here shows too complex an answer to too simple a > problem.
Here's an lxml version: from lxml import etree as et # http://codespeak.net/lxml html = et.HTML(page2) for href in html.xpath("//a/@href[string()]"): print href Doesn't count as a 15-liner, though, even if you add the above HTML code to it. Stefan -- http://mail.python.org/mailman/listinfo/python-list