On Jan 7, 5:38 pm, MRAB <pyt...@mrabarnett.plus.com> wrote: > Jorgen Grahn wrote: > > On Thu, 2010-01-07, Marco Salden wrote: > >> On Jan 6, 5:36 am, Philip Semanchuk <phi...@semanchuk.com> wrote: > >>> On Jan 5, 2010, at 11:26 PM, aditya shukla wrote: > > >>>> Hello people, > >>>> I have 5 directories corresponding 5 different urls .I want to > >>>> download > >>>> images from those urls and place them in the respective > >>>> directories.I have > >>>> to extract the contents and download them simultaneously.I can > >>>> extract the > >>>> contents and do then one by one. My questions is for doing it > >>>> simultaneously > >>>> do I have to use threads? > >>> No. You could spawn 5 copies of wget (or curl or a Python program that > >>> you've written). Whether or not that will perform better or be easier > >>> to code, debug and maintain depends on the other aspects of your > >>> program(s). > > >>> bye > >>> Philip > >> Yep, the more easier and straightforward the approach, the better: > >> threads are always (programmers')-error-prone by nature. > >> But my question would be: does it REALLY need to be simultaneously: > >> the CPU/OS only has more overhead doing this in parallel with > >> processess. Measuring sequential processing and then trying to > >> optimize (e.g. for user response or whatever) would be my prefered way > >> to go. Less=More. > > > Normally when you do HTTP in parallell over several TCP sockets, it > > has nothing to do with CPU overhead. You just don't want every GET to > > be delayed just because the server(s) are lazy responding to the first > > few ones; or you might want to read the text of a web page and the CSS > > before a few huge pictures have been downloaded. > > > His "I have to [do them] simultaneously" makes me want to ask "Why?". > > > If he's expecting *many* pictures, I doubt that the parallel download > > will buy him much. Reusing the same TCP socket for all of them is > > more likely to help, especially if the pictures aren't tiny. One > > long-lived TCP connection is much more efficient than dozens of > > short-lived ones. > > > Personally, I'd popen() wget and let it do the job for me. > > From my own experience: > > I wanted to download a number of webpages. > > I noticed that there was a significant delay before it would reply, and > an especially long delay for one of them, so I used a number of threads, > each one reading a URL from a queue, performing the download, and then > reading the next URL, until there were none left (actually, until it > read the sentinel None, which it put back for the other threads). > > The result? > > Shorter total download time because it could be downloading one webpage > while waiting for another to reply. > > (Of course, I had to make sure that I didn't have too many threads, > because that might've put too many demands on the website, not a nice > thing to do!)
A fair few of my scripts require multiple uploads and downloads, and I always use threads to do so. I was using an API which was quite badly designed, and I got a list of UserId's from one API call then had to query another API method to get info on each of the UserId's I got from the first API. I could have used twisted, but in the end I just made a simple thread pool (30 threads and an in/out Queue). The result? A *massive* speedup, even with the extra complications of waiting until all the threads are done then grouping the results together from the output Queue. Since then I always use native threads. Tom -- http://mail.python.org/mailman/listinfo/python-list