I am reading "Python for Dummies" and found the following example of a web crawler that I thought was interesting. The first time I keyed the program and executed it I didn't understand it well enough to debug it so I just skipped it. A few days later I realized that it failed after a few seconds and I wanted to know if it was a shortcoming of Python, a mistype on my part or just an inherent problem with the script so I retyped it and started trying to figure out what went wrong.
Please keep in mind I am very new to coding so I have tried RTFM without much success. I have a basic understanding of what the application is doing but I want to understand WHY it is doing it or what the rationale is for doing it. Not necessarily how it does it.. In any case here is the gist of the app. 1 - a new spider is created 2 - it takes a single argument which is a web address (http:// www.google.com) 3 - the spider pulls a copy of the page source 4 - the spider parses it for links and if the link is on the same domain and has not already been parsed then it appends the link to the list of pages to be parsed Being new I have a couple of questions that I am hoping someone can answer with some degree of detail. ---------------------------------------------------------- f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO())) parser = htmllib.HTMLParser(f) parser.feed(html) parser.close() return parser.anchorlist ---------------------------------------------------------- I get the idea that we're allocating some memory that looks like a file so formatter.dumbwriter can manipulate it. The results are passed to formatter.abstractformatter which does something else to the HTML code. The results are then passed to "f" which is then passed to htmllib.HTMLParser so it can parse the html for links. I guess I don't understand with any great detail as to why this is happening. I know someone is going to say that I should RTFM so here is the gist of the documentation: formatter.DumbWriter = "This class is suitable for reflowing a sequence of paragraphs." formatter.AbstractFormatter = "The standard formatter. This implementation has demonstrated wide applicability to many writers, and may be used directly in most circumstances. It has been used to implement a full-featured World Wide Web browser." <-- huh? So.. What is dumbwriter and abstractformatter doing with this HTML and why does it need to be done before parser.feed() gets a hold of it? The last question is.. I can't find any documentation to explain where the "anchorlist" attribute came from? Here is the only reference to this attribute that I can find anywhere in the Python documentation. ---------------------- anchor_bgn( href, name, type) This method is called at the start of an anchor region. The arguments correspond to the attributes of the <A> tag with the same names. The default implementation maintains a list of hyperlinks (defined by the HREF attribute for <A> tags) within the document. The list of hyperlinks is available as the data attribute anchorlist. ---------------------- So .. How does an average developer figure out that parser returns a list of hyperlinks in an attribute called anchorlist? Is this something that you just "figure out" or is there some book I should be reading that documents all of the attributes for a particular method? It just seems a bit obscure and certainly not something I would have figured out on my own. Does this make me a poor developer who should find another hobby? I just need to know if there is something wrong with me or if this is a reasonable question to ask. The last question I have is about debugging. The spider is capable of parsing links until it reaches: "html = get_page(http://www.google.com/jobs/fortune)" which returns the contents of a pdf document, assigns the pdf contents to html which is later passed to parser.feed(html) which crashes. I'm smart enough to know that whenever you take in some input that you should do some basic type checking to make sure that whatever you are trying to manipulate (especially if it originates from outside of your application) won't cause your application to crash. If you're expecting an ASCII character then make sure you're not getting an object or string of text. How would an experienced python developer check the contents of "html" to make sure its not something else other than a blob of HTML code? I should note an obviously catch-22.. How do I check the HTML in such a way that the check itself doesn't possibly crash the app? I thought about: try: parser.feed(html) except parser.HTMLParseError: parser.close() .... but i'm not sure if that is right or not? The app still crashes so obviously i'm doing something wrong. Here is the full app for your review. Thank you for any help you can provide! I greatly appreciate it! #!/usr/bin/python #these modules do most of the work import sys import urllib2 import urlparse import htmllib, formatter from cStringIO import StringIO def log_stdout(msg): """Print msg to the screen.""" print msg def get_page(url, log): """Retrieve URL and return comments, log errors.""" try: page = urllib2.urlopen(url) except urllib2.URLError: log("Error retrieving: " + url) return '' body = page.read() page.close() return body def find_links(html): """return a list of links in HTML""" #We're using the parser just to get the hrefs f = formatter.AbstractFormatter(formatter.DumbWriter(StringIO())) parser = htmllib.HTMLParser(f) parser.feed(html) parser.close() return parser.anchorlist class Spider: """ The heart of this program, finds all links within a web site. run() contains the main loop. process_page() retrieves each page and finds the links. """ def __init__(self, startURL, log=None): #this method sets initial values self.URLs = set() #create a set self.URLs.add(startURL) #add the start url to the set self.include = startURL self._links_to_process = [startURL] if log is None: #use log_stdout function if no log provided self.log = log_stdout else: self.log = log def run(self): #process list of URLs one at a time while self._links_to_process: url = self._links_to_process.pop() self.log("Retrieving: " + url) self.process_page(url) def url_in_site(self, link): #checks weather the link starts with the base URL return link.startswith(self.include) def process_page(self, url): #retrieves page and finds links in it html = get_page(url, self.log) for link in find_links(html): #handle relative links link = urlparse.urljoin(url,link) self.log("Checking: " + link) #make sure this is a new URL within current site if link not in self.URLs and self.url_in_site(link): self.URLs.add(link) self._links_to_process.append(link) if __name__ == '__main__': #this code runs when script is started from command line startURL = sys.argv[1] spider = Spider(startURL) spider.run() for URL in sorted(spider.URLs): print URL -- http://mail.python.org/mailman/listinfo/python-list