Hi Rares, Check out the Distributed Cache: http://wiki.apache.org/hadoop/FAQ#8
Thanks -Todd On Wed, May 27, 2009 at 9:24 PM, Rares Vernica <[email protected]> wrote: > Dear Hadoop Users, > > I am a newcomer into the Map-Reduce world. Please excuse my ignorance. > > I have two Map-Reduce phases. The first phase is the WordCount > example. In the second phase, besides the regular input data, the Map > function also needs the word-frequency table produced by the first > phase. > > Obviously, the word-frequency table is small enough to fit into > memory. Moreover, the first phase uses only one reduce, so that all > the data is in one file in HDFS. > > My question is, what options do I have to efficiently get the > word-frequency table to the map function of the second phase? > > One option is to access the HDFS form the map function and read the > file produced by the first Map-Reduce phase. More exactly, I would > read the file in the "setup" function. For this option, the machine > that stores this file would become a bottleneck as when the second > phase starts all the map instances will access that machine to get the > file. Is there any way to overcome this bottleneck? > > Are there any other options? > > Thank you, > Rares >
