Steven Bethard wrote:
> Rob Wolfe wrote:
>> Steven Bethard <[EMAIL PROTECTED]> writes:
>>> I'd hate to steer a potential new Python developer to a clumsier
>> "clumsier"???
>> Try to parse this with your program:
>> page2 = '''
>>      <html><head><title>URLs</title></head>
>>      <body>
>>      <ul>
>>      <li><a href="http://domain1/page1";>some page1</a></li>
>>      <li><a href="http://domain2/page2";>some page2</a></li>
>>      </body></html>
>>      '''
> If you want to parse invalid HTML, I strongly encourage you to look into
> BeautifulSoup. Here's the updated code:
>     import ElementSoup #
>     import cStringIO
>     tree = ElementSoup.parse(cStringIO.StringIO(page2))
>     for a_node in tree.getiterator('a'):
>         url = a_node.get('href')
>         if url is not None:
>             print url
>>> I know that the wiki page is supposed to be Python 2.4 only, but I'd
>>> rather have no example than an outdated one.
>> This example is by no means "outdated".
> Given the simplicity of the ElementSoup code above, I'd still contend
> that using HTMLParser here shows too complex an answer to too simple a
> problem.

Here's an lxml version:

  from lxml import etree as et   #
  html = et.HTML(page2)
  for href in html.xpath("//a/@href[string()]"):
      print href

Doesn't count as a 15-liner, though, even if you add the above HTML code to it.


Reply via email to