--- Steven Bethard <[EMAIL PROTECTED]> wrote: > Stefan Behnel wrote: > > Steven Bethard wrote: > >> If you want to parse invalid HTML, I strongly > encourage you to look into > >> BeautifulSoup. Here's the updated code: > >> > >> import ElementSoup # > http://effbot.org/zone/element-soup.htm > >> import cStringIO > >> > >> tree = > ElementSoup.parse(cStringIO.StringIO(page2)) > >> for a_node in tree.getiterator('a'): > >> url = a_node.get('href') > >> if url is not None: > >> print url > >> > [snip] > > > > Here's an lxml version: > > > > from lxml import etree as et # > http://codespeak.net/lxml > > html = et.HTML(page2) > > for href in html.xpath("//a/@href[string()]"): > > print href > > > > Doesn't count as a 15-liner, though, even if you > add the above HTML code to it. > > Definitely better than the HTMLParser code. =) > Personally, I still > prefer the xpath-less version, but that's only > because I can never > remember what all the line noise characters in xpath > mean. ;-) >
I think there might be other people who will balk at the xpath syntax, simply due to their unfamiliarity with it. And, on the other hand, if you really like the xpath syntax, then the program really becomes more of an endorsement for xpath's clean syntax than for Python's. To the extent that Python enables you to implement an xpath solution cleanly, that's great, but then you have the problem that lxml is not batteries included. I do hope we can find something to put on the page, but I'm the wrong person to decide on it, since I don't really do any rigorous HTML screen scraping in my current coding. (I still think it's a common use case, even though I don't do it myself.) I suggested earlier that maybe we post multiple solutions. That makes me a little nervous, to the extent that it shows that the Python community has a hard time coming to consensus on tools sometimes. This is not a completely unfair knock on Python, although I think the reason multiple solutions tend to emerge for this type of thing is precisely due to the simplicity and power of the language itself. So I don't know. What about trying to agree on an XML parsing example instead? Thoughts? ____________________________________________________________________________________ Pinpoint customers who are looking for what you sell. http://searchmarketing.yahoo.com/ -- http://mail.python.org/mailman/listinfo/python-list