Thanks for your reply. Obviously you make several good points about Beautiful Soup and Queue. But here's the problem: even if I do nothing whatsoever with the threads beyond just visiting the urls with urllib2, the program chokes. If I replace
else: ulock.acquire() print page.geturl() # obviously, do something more useful here,eventually page.close() ulock.release() with else: pass the urllib2 starts raising URLErrros after the first 3 - 5 urls have been visited. Do you have any sense what in the threads is corrupting urllib2's behavior? Many thanks, Robean On May 1, 12:27 am, Paul Rubin <http://phr...@nospam.invalid> wrote: > robean <st1...@gmail.com> writes: > > reach the urls with urllib2. The actual program will involve fairly > > elaborate scraping and parsing (I'm using Beautiful Soup for that) but > > the example shown here is simplified and just confirms the url of the > > site visited. > > Keep in mind Beautiful Soup is pretty slow, so if you're doing a lot > of pages and have multiple cpu's, you probably want parallel processes > rather than threads. > > > wrong? I am new to both threading and urllib2, so its possible that > > the SNAFU is quite obvious.. > > ... > > ulock = threading.Lock() > > Without looking at the code for more than a few seconds, using an > explicit lock like that is generally not a good sign. The usual > Python style is to send all inter-thread communications through > Queues. You'd dump all your url's into a queue and have a bunch of > worker threads getting items off the queue and processing them. This > really avoids a lot of lock-related headache. The price is that you > sometimes use more threads than strictly necessary. Unless it's a LOT > of extra threads, it's usually not worth the hassle of messing with > locks. -- http://mail.python.org/mailman/listinfo/python-list