I am writing a program that involves visiting several hundred webpages and extracting specific information from the contents. I've written a modest 'test' example here that uses a multi-threaded approach to reach the urls with urllib2. The actual program will involve fairly elaborate scraping and parsing (I'm using Beautiful Soup for that) but the example shown here is simplified and just confirms the url of the site visited.
Here's the problem: the script simply crashes after getting a a couple of urls and takes a long time to run (slower that a non-threaded version that I wrote and ran). Can anyone figure out what I am doing wrong? I am new to both threading and urllib2, so its possible that the SNAFU is quite obvious. The urls are stored in a text file that I read from. The urls are all valid, so there's no problem there. Here's the code: #!/usr/bin/python import urllib2 import threading class MyThread(threading.Thread): """subclass threading.Thread to create Thread instances""" def __init__(self, func, args): threading.Thread.__init__(self) self.func = func self.args = args def run(self): apply(self.func, self.args) def get_info_from_url(url): """ A dummy version of the function simply visits urls and prints the url of the page. """ try: page = urllib2.urlopen(url) except urllib2.URLError, e: print "**** error ****", e.reason except urllib2.HTTPError, e: print "**** error ****", e.code else: ulock.acquire() print page.geturl() # obviously, do something more useful here, eventually page.close() ulock.release() ulock = threading.Lock() num_links = 10 threads = [] # store threads here urls = [] # store urls here fh = open("links.txt", "r") for line in fh: urls.append(line.strip()) fh.close() # collect threads for i in range(num_links): t = MyThread(get_info_from_url, (urls[i],) ) threads.append(t) # start the threads for i in range(num_links): threads[i].start() for i in range(num_links): threads[i].join() print "all done" -- http://mail.python.org/mailman/listinfo/python-list