Re: RecordReader design heuristic

Jeff Eastman Wed, 18 Mar 2009 09:02:51 -0700

Hi Josh,

It seemed like you had a conceptual wire crossed and I'm glad to helpout. The neat thing about Hadoop mappers is - since they are given areplicated HDFS block to munch on - the job scheduler has <replicationfactor> number of node choices where it can run each mapper. This meansmappers are always reading from local storage.

On another note, I notice you are processing what looks to be largequantities of vector data. If you have any interest in clustering thisdata you might want to look at the Mahout project(http://lucene.apache.org/mahout/). We have a number of Hadoop-readyclustering algorithms, including a new non-parametric Dirichlet ProcessClustering implementation that I committed recently. We are pulling itall together for a 0.1 release and I would be very interested in helpingyou to apply these algorithms if you have an interest.


Jeff


Patterson, Josh wrote:

Jeff,
ok, that makes more sense, I was under the mis-impression that it was creating 
and destroying mappers for each input record. I dont know why I had that in my 
head. My design suddenly became a lot clearer, and this provides a much more 
clean abstraction. Thanks for your help!

Josh Patterson
TVA

Re: RecordReader design heuristic

Reply via email to