Hi, The main while in main thread spend all cpu time, it's more convenient put one little sleep between each iteration or use a some synchronization method between threads.
And about your questions IMO: > --- Are my setup and use of threads, the queue, and "while True" loop > correct or conventional? May be, exist another possibility but this it's good, another question is if iterate arround the 240000 numbers it's the more efficient form for retrieve all projects. --- Should the program sleep sometimes, to be nice to the SourceForge > servers, and so they don't think this is a denial-of-service attack? You are limiting your number of connections whit you concurrent threads, i don't believe that SourceForge have a less capacity for request you concurrent threads. > > --- Someone told me that popen is not thread-safe, and to use > mechanize. I installed it and followed an example on the web site. > There wasn't a good description of it on the web site, or I didn't > find it. Could someone explain what mechanize does? I don't know , but if you don't sure you can use urllib2. > > --- How do I choose the number of threads? I am using a MacBook Pro > 2.4GHz Intel Core 2 Duo with 4 GB 667 MHz DDR2 SDRAM, running OS > 10.5.3. For default phtreads in linux flavor spend 8MB for thread stack, i dont know in you MacBook. i think between 64 to 128 threads it's correct. > <http://10.5.3.> > > Thank you. > > Winston > > > > #!/usr/bin/env python > > # Winston C. Yang > # Created 2008-06-14 > > from __future__ import with_statement > > import mechanize > import os > import Queue > import re > import sys > import threading > import time > > lock = threading.RLock() > > # Make the dot match even a newline. > error_pattern = re.compile(".*\n<!--pageid login -->\n.*", re.DOTALL) > > def now(): > return time.strftime("%Y-%m-%d %H:%M:%S") > > def worker(): > > while True: > > try: > id = queue.get() > except Queue.Empty: > continue > > request = mechanize.Request("http://sourceforge.net/project/"\ > "memberlist.php?group_id=%d" % > id) > response = mechanize.urlopen(request) > text = response.read() > > valid_id = not error_pattern.match(text) > > if valid_id: > f = open("%d.csv" % id, "w+") > f.write(text) > f.close() > > with lock: > print "\t".join((str(id), now(), "+" if valid_id else > "-")) > > def fatal_error(): > print "usage: python application start_id end_id" > print > print "Get the usernames associated with each SourceForge project > with" > print "ID between start_id and end_id, inclusive." > print > print "start_id and end_id must be positive integers and satisfy" > print "start_id <= end_id." > sys.exit(1) > > if __name__ == "__main__": > > if len(sys.argv) == 3: > > try: > start_id = int(sys.argv[1]) > > if start_id <= 0: > raise Exception > > end_id = int(sys.argv[2]) > > if end_id < start_id: > raise Exception > except: > fatal_error() > else: > fatal_error() > > # Print the start time. > start_time = now() > print start_time > > # Create a directory whose name contains the start time. > dir = start_time.replace(" ", "_").replace(":", "_") > os.mkdir(dir) > os.chdir(dir) > > queue = Queue.Queue(0) > > for i in xrange(32): > t = threading.Thread(target=worker, name="worker %d" % (i + > 1)) > t.setDaemon(True) > t.start() > > for id in xrange(start_id, end_id + 1): > queue.put(id) > > # When the queue has size zero, exit in three seconds. > while True: > if queue.qsize() == 0: > time.sleep(3) > break > > print now() > -- > http://mail.python.org/mailman/listinfo/python-list > -- Pau Freixes Linux GNU/User
-- http://mail.python.org/mailman/listinfo/python-list