"Johnny Lee" <[EMAIL PROTECTED]> writes:
> Fredrik Lundh wrote:
[...]
> To the HTMLParser, there is another problem (take my code for example):
>
> import urllib
> import formatter
> parser = htmllib.HTMLParser(formatter.NullFormatter())
> parser.feed(urllib.urlopen(baseUrl).read())
> parser.close()
> for url in parser.anchorlist:
> if url[0:7] == "http://":
> print url
>
> when the baseUrl="http://www.nba.com", there will raise an
> HTMLParseError because of a line of code "<! Copyright IBM Corporation,
> 2001, 2002 !>". I found that this line of code is inside <script> tags,
> maybe it's because of this?
No, i's because they're using a broken HTML comment (should be
"<!--comment-->"). BeautifulSoup is more tolerant:
import urllib2
from BeautifulSoup import BeautifulSoup
bs = BeautifulSoup(urllib2.urlopen('http://www.nba.com/').read())
for el in bs.fetch('a'):
print el['href']
Or you could pre-process the HTML using mxTidy, and carry on using
module htmllib.
Hmm, are you the same Johnny Lee who contributed the MSIE cookie
support to LWP?
John
--
http://mail.python.org/mailman/listinfo/python-list