Rem, what OS are you trying this on? Windows XP SP2 has a limit of around 40 tcp connections per second...
Remarkable wrote: > Hello all > > I am trying to write a reliable web-crawler. I tried to write my own > using recursion and found I quickly hit the "too many sockets" open > problem. So I looked for a threaded version that I could easily extend. > > The simplest/most reliable I found was called Spider.py (see attached). > > At this stage I want a spider that I can point at a site, let it do > it's thing, and reliable get a callback of sorts... including the html > (for me to parse), the url of the page in question (so I can log it) > and the urls-found-on-that-page (so I can strip out any ones I really > don't want and add them to the "seen-list". > > > Now, this is my question. > > The code above ALMOST works fine. The crawler crawls, I get the data I > need BUT... every now and again the code just pauses, I hit control-C > and it reports an error as if it has hit an exception and then carries > on!!! I like the fact that my spider_usage.py file has the minimum > amount of spider stuff in it... really just a main() and handle() > handler. > > How does this happen... is a thread being killed and then a new one is > made or what? I suspect it may have something to do with sockets timing > out, but I have no idea... > > By the way on small sites (100s of pages) it never gets to the stall, > it's on larger sites such as Amazon that it "fails" > > This is my other question > > It would be great to know, when the code is stalled, if it is doing > anything... is there any way to even print a full stop to screen? > > This is my last question > > Given python's suitability for this sort of thing (isn't google written > in it?) I can't believe that that there isn't a kick ass crawler > already out there... > > regards > > tom > > http://www.theotherblog.com/Articles/2006/08/04/python-web-crawler-spider/ -- http://mail.python.org/mailman/listinfo/python-list