(Addendum to my own post -- an identity mapper is probably not what you want. You'd actually want an inverting mapper that transforms (k, v) --> (v, k), to take advantage of the key-based sorting.)
- Aaron On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <[email protected]> wrote: > Hi Jim, > > The ability to perform locking of shared mutable state is a distinct > anti-goal of the MapReduce paradigm. One of the major benefits of writing > MapReduce programs is knowing that you don't have to worry about deadlock in > your code. If mappers could lock objects, then the failure and restart > semantics of individual tasks would be vastly more complicated. (What > happens if a map task crashes after it obtains a lock? Does it automatically > release the lock? Does some rollback mechanism undo everything that happened > after the lock was acquired? How would that work if--by definition--the > mapper node is no longer available?) > > A word frequency histogram function can certainly be written in MapReduce > without such state. You've got the right intuition, but a serial program is > not necessarily the best answer. Take the existing word count program. This > converts bags of words into (word, count) pairs. Then pass this through a > second pass, via an identity mapper to a set of combiners that each emit the > 100 most frequent words, to a single reducer that emits the 100 most > frequent words obtained by the combiners. > > Many other more complicated problems which seem to require shared state, in > truth, only require a second (or n+1'th) MapReduce pass. Adding multiple > passes is a very valid technique for building more complex dataflows. > > Cheers, > - Aaron > > > > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <[email protected]>wrote: > >> Hello, >> >> I was wondering if Hadoop provides thread safe shared variables that can >> be >> accessed from individual mappers/reducers along with a proper locking >> mechanism. To clarify things, let's say that in the word count example, I >> want to know the word that has the highest frequency and how many times it >> occured. I believe that the latter can be done using the counters that >> come >> with the Hadoop framework but I don't know how to get the word itself as a >> String. Of course, the problem can be more complicated like the top 100 >> words or so. >> >> I thought of writing a serial program which can go over the final output >> of >> the word count but this wouldn't be a good idea if the output file gets >> too >> large. However, if there is a way to define and use shared variables, this >> would be really easy to do on the fly during the word count's reduce >> phase. >> >> Thanks, >> Jim >> > >
