Hi Josh,
It seemed like you had a conceptual wire crossed and I'm glad to help out. The neat thing about Hadoop mappers is - since they are given a replicated HDFS block to munch on - the job scheduler has <replication factor> number of node choices where it can run each mapper. This means mappers are always reading from local storage.

On another note, I notice you are processing what looks to be large quantities of vector data. If you have any interest in clustering this data you might want to look at the Mahout project (http://lucene.apache.org/mahout/). We have a number of Hadoop-ready clustering algorithms, including a new non-parametric Dirichlet Process Clustering implementation that I committed recently. We are pulling it all together for a 0.1 release and I would be very interested in helping you to apply these algorithms if you have an interest.

Jeff


Patterson, Josh wrote:
Jeff,
ok, that makes more sense, I was under the mis-impression that it was creating 
and destroying mappers for each input record. I dont know why I had that in my 
head. My design suddenly became a lot clearer, and this provides a much more 
clean abstraction. Thanks for your help!

Josh Patterson
TVA


Reply via email to