Re: avoid custom crawler getting blocked

jason hadoop Wed, 27 May 2009 07:07:51 -0700

Random ordering helps with per thread delays based on domain recency also
helps.


On Wed, May 27, 2009 at 6:47 AM, Ken Krugler <kkrugler_li...@transpac.com>wrote:

> My current project is to gather stats from a lot of different documents.
>> We're are not indexing just getting quite specific stats for each
>> document.
>> We gather 12 different stats from each document.
>>
>> Our requirements have changed somewhat now, originally it was working on
>> documents from our own servers but now it needs to fetch other ones from
>> quite a large variety of sources.
>>
>> My approach up to now was to have the map function simply take each
>> filepath
>> (or now URL) in turn, fetch the document, calculate the stats and output
>> those stats.
>>
>> My new problem is some of the locations we are now visiting don't like
>> their
>> IP being hit multiple times in a row.
>>
>> Is it possible to check a URL against a visited list of IPs and if
>> recently
>> visited either wait for a certain amount of time or push it back onto the
>> input stack so it will be processed later in the queue?
>>
>> Or is there a better way?
>>
>
> Your use case is very similar to what we've been doing with Bixo. See
> http://bixo.101tec.com, and also
> http://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf
>
> Short answer is that we group URLs by paid-level domain in a map (actually
> using a Cascading GroupBy operation), and use per-domain queues with
> multi-threaded fetchers to efficiently load pages in a reduce (a Cascading
> Buffer operation).
>
> -- Ken
> --
> Ken Krugler
> +1 530-210-6378
>



-- 
Alpha Chapters of my book on Hadoop are available
http://www.apress.com/book/view/9781430219422
www.prohadoopbook.com a community for Hadoop Professionals

Re: avoid custom crawler getting blocked

Reply via email to