NumMapTasks and NumReduceTasks with MapRunnable

Marcus Herou Sat, 13 Dec 2008 01:31:14 -0800

Hi.

We are finally in the beta stage with our crawler and have tested it with a
few hundred thousand urls. However it performs worse than if we would run it
on a local machine without connecting to a hadoop JobTracker.
Each crawl Job is fairly alike a Nutch Fetcher job which spawns X threads
which all read the same RecordReader and starts to fetch the current url
assigned.
However I am not able to utilize all our 9 machines at the same time which
is really preferable since this is an external IO bound job (remote
servers).


How can I with a crawl list of just 9 urls (stupidly small I know) make sure
that all machines is used at least once ?
With a crawl list of 900 how can i make sure at least 100 are crawled at the
same time on all machines ?
And so on with much bigger crawl lists (which is why need hadoop anyway).

Just as I write this I launched a job where i manually set the numMapTasks
to 9 and it seems to be fruitful, quite fast crawl actually :) however I
wonder if this is how I should think with all MapRunnables ?
Next Job we call is PersistOutLinks and yep it goes through a massive list
of source->target links and saves them in a DB.

This list is of a magnitude of at least a 100 times larger than the Fetcher
list. Is it still smart to hardcode a value 9 to numMapTasks for this
MapRunnable job ? Or should I create some form of InputFormat.getInputSplits
based on the crawl/outlink sizes ? Of course the numMapTasks are not
hardcoded but they are injected into the Configuration based on a properties
file.

Kindly

//Marcus





-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
[email protected]
http://www.tailsweep.com/
http://blogg.tailsweep.com/

NumMapTasks and NumReduceTasks with MapRunnable

Reply via email to