Hi. We are finally in the beta stage with our crawler and have tested it with a few hundred thousand urls. However it performs worse than if we would run it on a local machine without connecting to a hadoop JobTracker. Each crawl Job is fairly alike a Nutch Fetcher job which spawns X threads which all read the same RecordReader and starts to fetch the current url assigned. However I am not able to utilize all our 9 machines at the same time which is really preferable since this is an external IO bound job (remote servers).
How can I with a crawl list of just 9 urls (stupidly small I know) make sure that all machines is used at least once ? With a crawl list of 900 how can i make sure at least 100 are crawled at the same time on all machines ? And so on with much bigger crawl lists (which is why need hadoop anyway). Just as I write this I launched a job where i manually set the numMapTasks to 9 and it seems to be fruitful, quite fast crawl actually :) however I wonder if this is how I should think with all MapRunnables ? Next Job we call is PersistOutLinks and yep it goes through a massive list of source->target links and saves them in a DB. This list is of a magnitude of at least a 100 times larger than the Fetcher list. Is it still smart to hardcode a value 9 to numMapTasks for this MapRunnable job ? Or should I create some form of InputFormat.getInputSplits based on the crawl/outlink sizes ? Of course the numMapTasks are not hardcoded but they are injected into the Configuration based on a properties file. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 [email protected] http://www.tailsweep.com/ http://blogg.tailsweep.com/
