rzimerman ha scritto: > I'm hoping to write a program that will read any number of urls from > stdin (1 per line), download them, and process them. So far my script > (below) works well for small numbers of urls. However, it does not > scale to more than 200 urls or so, because it issues HTTP requests for > all of the urls simultaneously, and terminates after 25 seconds. > Ideally, I'd like this script to download at most 50 pages in parallel, > and to time out if and only if any HTTP request is not answered in 3 > seconds. What changes do I need to make? >
Take a look at http://svn.twistedmatrix.com/cvs/trunk/doc/core/examples/stdiodemo.py?view=markup&rev=15456 And read http://twistedmatrix.com/documents/current/api/twisted.web.client.HTTPClientFactory.html You can pass a timeout to the constructor. To download at most 50 pages in parallel you can use a download queue. Here is a quick example, ABSOLUTELY NOT TESTED: class DownloadQueue(object): SIZE = 50 def init(self): self.requests = [] # queued requests self.deferreds = [] # waiting requests def addRequest(self, url, timeout): if len(self.deferreds) >= sels.SIZE: # wait for completion of all previous requests DeferredList(self.deferreds ).addCallback(self._callback) self.deferreds = [] # queue the request deferred = Deferred() self.requests.append((url, timeout, deferred)) return deferred else: # execute the request now deferred = getPage(url, timeout=timeout) self.deferreds.append(deferred) return deferred def _callback(self): if len(self.requests) > self.SIZE: queue = self.requests[:self.SIZE] self.requests = self.requests[self.SIZE:] else: queue = self.requests[:] self.requests = [] # execute the requests for (url, timeout, deferredHelper) in queue: deferred = getPage(url, timeout=timeout) self.deferreds.append(deferred) deferred.chainDeferred(deferredHelper) Regards Manlio Perillo -- http://mail.python.org/mailman/listinfo/python-list