On May 22, 9:59 am, alex23 <[EMAIL PROTECTED]> wrote: > On May 22, 6:22 pm, [EMAIL PROTECTED] wrote: > > > Still getting very odd errors though, this being the latest: > > > Traceback (most recent call last): > > File "spider.py", line 38, in <module> > > [...snip...] > > raise InvalidURL("nonnumeric port: '%s'" % host[i+1:]) > > httplib.InvalidURL: nonnumeric port: '' > > Okay. What I did was put some output in your Spider.parse method: > > def parse(self, page): > try: > print 'http://' + page > self.feed(urlopen('http://' + page).read()) > except HTTPError: > print 'Error getting page source' > > And here's the output: > > >python spider.py > What site would you like to scan?http://www.google.com > http://www.google.com > http://http://images.google.com.au/imghp?hl=en&tab=wi > > The links you're finding on each page already have the protocol > specified. I'd remove the 'http://' addition from parse, and just add > it to 'site' in the main section. > > if __name__ == '__main__': > s = Spider() > site = raw_input("What site would you like to scan? http://") > site = 'http://' + site > s.crawl(site) > > > Also could you explain why I needed to add that > > HTMLParser.__init__(self) line? Does it matter that I have overwritten > > the __init__ function of spider? > > You haven't overwritten Spider.__init__. What you're doing every time > you create a Spider object is first get HTMLParser to initialise it as > it would any other HTMLParser object - which is what adds the .rawdata > attribute to each HTMLParser instance - *and then* doing the Spider- > specific initialisation you need. > > Here's an abbreviated copy of the actual HTMLParser class featuring > only its __init__ and reset methods: > > class HTMLParser(markupbase.ParserBase): > def __init__(self): > """Initialize and reset this instance.""" > self.reset() > > def reset(self): > """Reset this instance. Loses all unprocessed data.""" > self.rawdata = '' > self.lasttag = '???' > self.interesting = interesting_normal > markupbase.ParserBase.reset(self) > > When you initialise an instance of HTMLParser, it calls its reset > method, which sets rawdata to an empty string, or adds it to the > instance if it doesn't already exist. So when you call > HTMLParser.__init__(self) in Spider.__init__(), it executes the reset > method on the Spider instance, which it inherits from HTMLParser... > > Are you familiar with object oriented design at all? If you're not, > let me know and I'll track down some decent intro docs. Inheritance is > a pretty fundamental concept but I don't think I'm doing it justice.
Nope, this is my first experience with object oriented programming, only been learning python for a few weeks but it seemed simple enough to inspire me to be a bit ambitious. If you could hook me up with some good docs that would be great. I was about to but a book on python, specifically OO based, but il look at these docs first. I understand most of the concepts of inheritance, just not ever used them before. Thanks -- http://mail.python.org/mailman/listinfo/python-list