Re: avoid custom crawler getting blocked

Ken Krugler Wed, 27 May 2009 06:50:06 -0700

My current project is to gather stats from a lot of different documents.
We're are not indexing just getting quite specific stats for each document.
We gather 12 different stats from each document.


Our requirements have changed somewhat now, originally it was working on
documents from our own servers but now it needs to fetch other ones from
quite a large variety of sources.

My approach up to now was to have the map function simply take each filepath
(or now URL) in turn, fetch the document, calculate the stats and output
those stats.

My new problem is some of the locations we are now visiting don't like their
IP being hit multiple times in a row.

Is it possible to check a URL against a visited list of IPs and if recently
visited either wait for a certain amount of time or push it back onto the
input stack so it will be processed later in the queue?

Or is there a better way?

Your use case is very similar to what we've been doing with Bixo. Seehttp://bixo.101tec.com, and alsohttp://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf

Short answer is that we group URLs by paid-level domain in a map(actually using a Cascading GroupBy operation), and use per-domainqueues with multi-threaded fetchers to efficiently load pages in areduce (a Cascading Buffer operation).


-- Ken
--
Ken Krugler
+1 530-210-6378

Re: avoid custom crawler getting blocked

Reply via email to