Dear Hadoop Users,

I am a newcomer into the Map-Reduce world. Please excuse my ignorance.

I have two Map-Reduce phases. The first phase is the WordCount
example. In the second phase, besides the regular input data, the Map
function also needs the word-frequency table produced by the first
phase.

Obviously, the word-frequency table is small enough to fit into
memory. Moreover, the first phase uses only one reduce, so that all
the data is in one file in HDFS.

My question is, what options do I have to efficiently get the
word-frequency table to the map function of the second phase?

One option is to access the HDFS form the map function and read the
file produced by the first Map-Reduce phase. More exactly, I would
read the file in the "setup" function. For this option, the machine
that stores this file would become a bottleneck as when the second
phase starts all the map instances will access that machine to get the
file. Is there any way to overcome this bottleneck?

Are there any other options?

Thank you,
Rares

Reply via email to