On May 22, 6:22 pm, [EMAIL PROTECTED] wrote: > Still getting very odd errors though, this being the latest: > > Traceback (most recent call last): > File "spider.py", line 38, in <module> > [...snip...] > raise InvalidURL("nonnumeric port: '%s'" % host[i+1:]) > httplib.InvalidURL: nonnumeric port: ''
Okay. What I did was put some output in your Spider.parse method: def parse(self, page): try: print 'http://' + page self.feed(urlopen('http://' + page).read()) except HTTPError: print 'Error getting page source' And here's the output: >python spider.py What site would you like to scan? http://www.google.com http://www.google.com http://http://images.google.com.au/imghp?hl=en&tab=wi The links you're finding on each page already have the protocol specified. I'd remove the 'http://' addition from parse, and just add it to 'site' in the main section. if __name__ == '__main__': s = Spider() site = raw_input("What site would you like to scan? http://") site = 'http://' + site s.crawl(site) > Also could you explain why I needed to add that > HTMLParser.__init__(self) line? Does it matter that I have overwritten > the __init__ function of spider? You haven't overwritten Spider.__init__. What you're doing every time you create a Spider object is first get HTMLParser to initialise it as it would any other HTMLParser object - which is what adds the .rawdata attribute to each HTMLParser instance - *and then* doing the Spider- specific initialisation you need. Here's an abbreviated copy of the actual HTMLParser class featuring only its __init__ and reset methods: class HTMLParser(markupbase.ParserBase): def __init__(self): """Initialize and reset this instance.""" self.reset() def reset(self): """Reset this instance. Loses all unprocessed data.""" self.rawdata = '' self.lasttag = '???' self.interesting = interesting_normal markupbase.ParserBase.reset(self) When you initialise an instance of HTMLParser, it calls its reset method, which sets rawdata to an empty string, or adds it to the instance if it doesn't already exist. So when you call HTMLParser.__init__(self) in Spider.__init__(), it executes the reset method on the Spider instance, which it inherits from HTMLParser... Are you familiar with object oriented design at all? If you're not, let me know and I'll track down some decent intro docs. Inheritance is a pretty fundamental concept but I don't think I'm doing it justice. -- http://mail.python.org/mailman/listinfo/python-list