"Johnny Lee" <[EMAIL PROTECTED]> writes: > Fredrik Lundh wrote: [...] > To the HTMLParser, there is another problem (take my code for example): > > import urllib > import formatter > parser = htmllib.HTMLParser(formatter.NullFormatter()) > parser.feed(urllib.urlopen(baseUrl).read()) > parser.close() > for url in parser.anchorlist: > if url[0:7] == "http://": > print url > > when the baseUrl="http://www.nba.com", there will raise an > HTMLParseError because of a line of code "<! Copyright IBM Corporation, > 2001, 2002 !>". I found that this line of code is inside <script> tags, > maybe it's because of this?
No, i's because they're using a broken HTML comment (should be "<!--comment-->"). BeautifulSoup is more tolerant: import urllib2 from BeautifulSoup import BeautifulSoup bs = BeautifulSoup(urllib2.urlopen('http://www.nba.com/').read()) for el in bs.fetch('a'): print el['href'] Or you could pre-process the HTML using mxTidy, and carry on using module htmllib. Hmm, are you the same Johnny Lee who contributed the MSIE cookie support to LWP? John -- http://mail.python.org/mailman/listinfo/python-list