Hi Josh,
It seemed like you had a conceptual wire crossed and I'm glad to help
out. The neat thing about Hadoop mappers is - since they are given a
replicated HDFS block to munch on - the job scheduler has <replication
factor> number of node choices where it can run each mapper. This means
mappers are always reading from local storage.
On another note, I notice you are processing what looks to be large
quantities of vector data. If you have any interest in clustering this
data you might want to look at the Mahout project
(http://lucene.apache.org/mahout/). We have a number of Hadoop-ready
clustering algorithms, including a new non-parametric Dirichlet Process
Clustering implementation that I committed recently. We are pulling it
all together for a 0.1 release and I would be very interested in helping
you to apply these algorithms if you have an interest.
Jeff
Patterson, Josh wrote:
Jeff,
ok, that makes more sense, I was under the mis-impression that it was creating
and destroying mappers for each input record. I dont know why I had that in my
head. My design suddenly became a lot clearer, and this provides a much more
clean abstraction. Thanks for your help!
Josh Patterson
TVA