Have you had a look at Nutch (http://lucene.apache.org/nutch/)? It has solved this kind of problem.
Cheers, Tom On Wed, May 27, 2009 at 9:58 AM, John Clarke <clarke...@gmail.com> wrote: > My current project is to gather stats from a lot of different documents. > We're are not indexing just getting quite specific stats for each document. > We gather 12 different stats from each document. > > Our requirements have changed somewhat now, originally it was working on > documents from our own servers but now it needs to fetch other ones from > quite a large variety of sources. > > My approach up to now was to have the map function simply take each filepath > (or now URL) in turn, fetch the document, calculate the stats and output > those stats. > > My new problem is some of the locations we are now visiting don't like their > IP being hit multiple times in a row. > > Is it possible to check a URL against a visited list of IPs and if recently > visited either wait for a certain amount of time or push it back onto the > input stack so it will be processed later in the queue? > > Or is there a better way? > > Thanks, > John >