Random ordering helps with per thread delays based on domain recency also helps.
On Wed, May 27, 2009 at 6:47 AM, Ken Krugler <kkrugler_li...@transpac.com>wrote: > My current project is to gather stats from a lot of different documents. >> We're are not indexing just getting quite specific stats for each >> document. >> We gather 12 different stats from each document. >> >> Our requirements have changed somewhat now, originally it was working on >> documents from our own servers but now it needs to fetch other ones from >> quite a large variety of sources. >> >> My approach up to now was to have the map function simply take each >> filepath >> (or now URL) in turn, fetch the document, calculate the stats and output >> those stats. >> >> My new problem is some of the locations we are now visiting don't like >> their >> IP being hit multiple times in a row. >> >> Is it possible to check a URL against a visited list of IPs and if >> recently >> visited either wait for a certain amount of time or push it back onto the >> input stack so it will be processed later in the queue? >> >> Or is there a better way? >> > > Your use case is very similar to what we've been doing with Bixo. See > http://bixo.101tec.com, and also > http://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf > > Short answer is that we group URLs by paid-level domain in a map (actually > using a Cascading GroupBy operation), and use per-domain queues with > multi-threaded fetchers to efficiently load pages in a reduce (a Cascading > Buffer operation). > > -- Ken > -- > Ken Krugler > +1 530-210-6378 > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals