Steven Bethard wrote: > Rob Wolfe wrote: >> Steven Bethard <[EMAIL PROTECTED]> writes: >>> I'd hate to steer a potential new Python developer to a clumsier >> >> "clumsier"??? >> Try to parse this with your program: >> >> page2 = ''' >> <html><head><title>URLs</title></head> >> <body> >> <ul> >> <li><a href="http://domain1/page1">some page1</a></li> >> <li><a href="http://domain2/page2">some page2</a></li> >> </body></html> >> ''' > > If you want to parse invalid HTML, I strongly encourage you to look into > BeautifulSoup. Here's the updated code: > > import ElementSoup # http://effbot.org/zone/element-soup.htm > import cStringIO > > tree = ElementSoup.parse(cStringIO.StringIO(page2)) > for a_node in tree.getiterator('a'): > url = a_node.get('href') > if url is not None: > print url
I should also have pointed out that using the above ElementSoup code can parse the following text:: <html><head><title>URLs</title></head> <body> <ul> <li<a href="http://domain1/page1">some page1</a></li> <li><a href="http://domain2/page2">some page2</a></li> </body></html> where the HTMLParser code raises an HTMLParseError. STeVe -- http://mail.python.org/mailman/listinfo/python-list