Aaron, I actually do something different than word count. I count all possible phrases for every sentence in my corpus. So for instance, if I have a sentence like "Hello world", my mappers emit:
Hello 1 World 1 Hello World 1 As you can easily realize, for longer sentences the number of intermediate records grow much more than the original input size. Anyway, I did what I said last week based on your previous replies and it worked well. Thank you for the advice. Jim On Wed, Dec 31, 2008 at 4:06 AM, Aaron Kimball <[email protected]> wrote: > Hmm. Check your math on the data set size. Your input corpus may be a few > (dozen, hundred) TB, but how many distinct words are there? The output data > set should be at least a thousand times smaller. If you've got the hardware > to do that initial word count step on a few TB of data, the second pass > will > not be a major performance concern. > > MapReduce is, to borrow from a tired analogy, a lot like driving a freight > train. The raw speed of any given algorithm on it might not sound > impressive, but even if its got a much higher constant-factor of time > associated with it, the ability to provide nearly-flat parallelism as your > data set grows really large more than makes up for it in the long run. > - Aaron > > On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky <[email protected]> > wrote: > > > Hello again, > > > > I think I found an answer to my question. If I write a new > > WritableComparable object that extends IntWritable and then overwrite the > > compareTo method, I can change the sorting order from ascending to > > descending. That will solve my problem for getting the top 100 most > > frequent > > words at each combiner/reducer. > > > > Jim > > > > On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky <[email protected]> > > wrote: > > > > > Hi Aaron, > > > > > > Thanks for the advice. I actually thought of using multiple combiners > and > > a > > > single reducer but I was worried about the key sorting phase to be a > > vaste > > > for my purpose. If the input is just a bunch of (word,count) pairs > which > > is > > > in the order of TeraBytes, wouldn't sorting be an overkill? That's why > I > > > thought a single serial program might perform better but I'm not sure > how > > > long it would take to sort the keys in such a case so probably it is > > nothing > > > beyond speculation and I should go and give it a try to see how well it > > > performs. > > > > > > Secondly, I didn't quite understand how I can take advantage of the > > sorted > > > keys if I use an inverting mapper that transforms (k,v) --> (v,k) > pairs. > > In > > > both cases, the combiners and the single reducer will still have to > > iterate > > > over all the (v,k) pairs to find the top 100 right? Or is there a way > to > > say > > > something like "Give me the last 100 keys" at each reducer/combiner? > > > > > > Thanks in advance, > > > Jim > > > > > > > > > On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball <[email protected]> > > wrote: > > > > > >> (Addendum to my own post -- an identity mapper is probably not what > you > > >> want. You'd actually want an inverting mapper that transforms (k, v) > --> > > >> (v, > > >> k), to take advantage of the key-based sorting.) > > >> > > >> - Aaron > > >> > > >> On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <[email protected]> > > >> wrote: > > >> > > >> > Hi Jim, > > >> > > > >> > The ability to perform locking of shared mutable state is a distinct > > >> > anti-goal of the MapReduce paradigm. One of the major benefits of > > >> writing > > >> > MapReduce programs is knowing that you don't have to worry about > > >> deadlock in > > >> > your code. If mappers could lock objects, then the failure and > restart > > >> > semantics of individual tasks would be vastly more complicated. > (What > > >> > happens if a map task crashes after it obtains a lock? Does it > > >> automatically > > >> > release the lock? Does some rollback mechanism undo everything that > > >> happened > > >> > after the lock was acquired? How would that work if--by > > definition--the > > >> > mapper node is no longer available?) > > >> > > > >> > A word frequency histogram function can certainly be written in > > >> MapReduce > > >> > without such state. You've got the right intuition, but a serial > > program > > >> is > > >> > not necessarily the best answer. Take the existing word count > program. > > >> This > > >> > converts bags of words into (word, count) pairs. Then pass this > > through > > >> a > > >> > second pass, via an identity mapper to a set of combiners that each > > emit > > >> the > > >> > 100 most frequent words, to a single reducer that emits the 100 most > > >> > frequent words obtained by the combiners. > > >> > > > >> > Many other more complicated problems which seem to require shared > > state, > > >> in > > >> > truth, only require a second (or n+1'th) MapReduce pass. Adding > > multiple > > >> > passes is a very valid technique for building more complex > dataflows. > > >> > > > >> > Cheers, > > >> > - Aaron > > >> > > > >> > > > >> > > > >> > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <[email protected] > > >> >wrote: > > >> > > > >> >> Hello, > > >> >> > > >> >> I was wondering if Hadoop provides thread safe shared variables > that > > >> can > > >> >> be > > >> >> accessed from individual mappers/reducers along with a proper > locking > > >> >> mechanism. To clarify things, let's say that in the word count > > example, > > >> I > > >> >> want to know the word that has the highest frequency and how many > > times > > >> it > > >> >> occured. I believe that the latter can be done using the counters > > that > > >> >> come > > >> >> with the Hadoop framework but I don't know how to get the word > itself > > >> as a > > >> >> String. Of course, the problem can be more complicated like the top > > 100 > > >> >> words or so. > > >> >> > > >> >> I thought of writing a serial program which can go over the final > > >> output > > >> >> of > > >> >> the word count but this wouldn't be a good idea if the output file > > gets > > >> >> too > > >> >> large. However, if there is a way to define and use shared > variables, > > >> this > > >> >> would be really easy to do on the fly during the word count's > reduce > > >> >> phase. > > >> >> > > >> >> Thanks, > > >> >> Jim > > >> >> > > >> > > > >> > > > >> > > > > > > > > >
