Hmm. Check your math on the data set size. Your input corpus may be a few (dozen, hundred) TB, but how many distinct words are there? The output data set should be at least a thousand times smaller. If you've got the hardware to do that initial word count step on a few TB of data, the second pass will not be a major performance concern.
MapReduce is, to borrow from a tired analogy, a lot like driving a freight train. The raw speed of any given algorithm on it might not sound impressive, but even if its got a much higher constant-factor of time associated with it, the ability to provide nearly-flat parallelism as your data set grows really large more than makes up for it in the long run. - Aaron On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky <[email protected]> wrote: > Hello again, > > I think I found an answer to my question. If I write a new > WritableComparable object that extends IntWritable and then overwrite the > compareTo method, I can change the sorting order from ascending to > descending. That will solve my problem for getting the top 100 most > frequent > words at each combiner/reducer. > > Jim > > On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky <[email protected]> > wrote: > > > Hi Aaron, > > > > Thanks for the advice. I actually thought of using multiple combiners and > a > > single reducer but I was worried about the key sorting phase to be a > vaste > > for my purpose. If the input is just a bunch of (word,count) pairs which > is > > in the order of TeraBytes, wouldn't sorting be an overkill? That's why I > > thought a single serial program might perform better but I'm not sure how > > long it would take to sort the keys in such a case so probably it is > nothing > > beyond speculation and I should go and give it a try to see how well it > > performs. > > > > Secondly, I didn't quite understand how I can take advantage of the > sorted > > keys if I use an inverting mapper that transforms (k,v) --> (v,k) pairs. > In > > both cases, the combiners and the single reducer will still have to > iterate > > over all the (v,k) pairs to find the top 100 right? Or is there a way to > say > > something like "Give me the last 100 keys" at each reducer/combiner? > > > > Thanks in advance, > > Jim > > > > > > On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball <[email protected]> > wrote: > > > >> (Addendum to my own post -- an identity mapper is probably not what you > >> want. You'd actually want an inverting mapper that transforms (k, v) --> > >> (v, > >> k), to take advantage of the key-based sorting.) > >> > >> - Aaron > >> > >> On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <[email protected]> > >> wrote: > >> > >> > Hi Jim, > >> > > >> > The ability to perform locking of shared mutable state is a distinct > >> > anti-goal of the MapReduce paradigm. One of the major benefits of > >> writing > >> > MapReduce programs is knowing that you don't have to worry about > >> deadlock in > >> > your code. If mappers could lock objects, then the failure and restart > >> > semantics of individual tasks would be vastly more complicated. (What > >> > happens if a map task crashes after it obtains a lock? Does it > >> automatically > >> > release the lock? Does some rollback mechanism undo everything that > >> happened > >> > after the lock was acquired? How would that work if--by > definition--the > >> > mapper node is no longer available?) > >> > > >> > A word frequency histogram function can certainly be written in > >> MapReduce > >> > without such state. You've got the right intuition, but a serial > program > >> is > >> > not necessarily the best answer. Take the existing word count program. > >> This > >> > converts bags of words into (word, count) pairs. Then pass this > through > >> a > >> > second pass, via an identity mapper to a set of combiners that each > emit > >> the > >> > 100 most frequent words, to a single reducer that emits the 100 most > >> > frequent words obtained by the combiners. > >> > > >> > Many other more complicated problems which seem to require shared > state, > >> in > >> > truth, only require a second (or n+1'th) MapReduce pass. Adding > multiple > >> > passes is a very valid technique for building more complex dataflows. > >> > > >> > Cheers, > >> > - Aaron > >> > > >> > > >> > > >> > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <[email protected] > >> >wrote: > >> > > >> >> Hello, > >> >> > >> >> I was wondering if Hadoop provides thread safe shared variables that > >> can > >> >> be > >> >> accessed from individual mappers/reducers along with a proper locking > >> >> mechanism. To clarify things, let's say that in the word count > example, > >> I > >> >> want to know the word that has the highest frequency and how many > times > >> it > >> >> occured. I believe that the latter can be done using the counters > that > >> >> come > >> >> with the Hadoop framework but I don't know how to get the word itself > >> as a > >> >> String. Of course, the problem can be more complicated like the top > 100 > >> >> words or so. > >> >> > >> >> I thought of writing a serial program which can go over the final > >> output > >> >> of > >> >> the word count but this wouldn't be a good idea if the output file > gets > >> >> too > >> >> large. However, if there is a way to define and use shared variables, > >> this > >> >> would be really easy to do on the fly during the word count's reduce > >> >> phase. > >> >> > >> >> Thanks, > >> >> Jim > >> >> > >> > > >> > > >> > > > > >
