Re: Help on regular expression match

John J. Lee Sat, 24 Sep 2005 04:21:03 -0700

"Johnny Lee" <[EMAIL PROTECTED]> writes:

> Fredrik Lundh wrote:
[...]
> To the HTMLParser, there is another problem (take my code for example):
> 
> import urllib
> import formatter
> parser = htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(urllib.urlopen(baseUrl).read())
> parser.close()
> for url in parser.anchorlist:
>       if url[0:7] == "http://":
>               print url
> 
> when the baseUrl="http://www.nba.com";, there will raise an
> HTMLParseError because of a line of code "<! Copyright IBM Corporation,
> 2001, 2002 !>". I found that this line of code is inside <script> tags,
> maybe it's because of this?


No, i's because they're using a broken HTML comment (should be
"<!--comment-->").  BeautifulSoup is more tolerant:

import urllib2
from BeautifulSoup import BeautifulSoup
bs = BeautifulSoup(urllib2.urlopen('http://www.nba.com/').read())
for el in bs.fetch('a'):
    print el['href']


Or you could pre-process the HTML using mxTidy, and carry on using
module htmllib.

Hmm, are you the same Johnny Lee who contributed the MSIE cookie
support to LWP?


John

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Help on regular expression match

Reply via email to