My current project is to gather stats from a lot of different documents.
We're are not indexing just getting quite specific stats for each document.
We gather 12 different stats from each document.

Our requirements have changed somewhat now, originally it was working on
documents from our own servers but now it needs to fetch other ones from
quite a large variety of sources.

My approach up to now was to have the map function simply take each filepath
(or now URL) in turn, fetch the document, calculate the stats and output
those stats.

My new problem is some of the locations we are now visiting don't like their
IP being hit multiple times in a row.

Is it possible to check a URL against a visited list of IPs and if recently
visited either wait for a certain amount of time or push it back onto the
input stack so it will be processed later in the queue?

Or is there a better way?

Your use case is very similar to what we've been doing with Bixo. See http://bixo.101tec.com, and also http://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf

Short answer is that we group URLs by paid-level domain in a map (actually using a Cascading GroupBy operation), and use per-domain queues with multi-threaded fetchers to efficiently load pages in a reduce (a Cascading Buffer operation).

-- Ken
--
Ken Krugler
+1 530-210-6378

Reply via email to