Dear Hadoop Users, I am a newcomer into the Map-Reduce world. Please excuse my ignorance.
I have two Map-Reduce phases. The first phase is the WordCount example. In the second phase, besides the regular input data, the Map function also needs the word-frequency table produced by the first phase. Obviously, the word-frequency table is small enough to fit into memory. Moreover, the first phase uses only one reduce, so that all the data is in one file in HDFS. My question is, what options do I have to efficiently get the word-frequency table to the map function of the second phase? One option is to access the HDFS form the map function and read the file produced by the first Map-Reduce phase. More exactly, I would read the file in the "setup" function. For this option, the machine that stores this file would become a bottleneck as when the second phase starts all the map instances will access that machine to get the file. Is there any way to overcome this bottleneck? Are there any other options? Thank you, Rares