Hmm. Check your math on the data set size. Your input corpus may be a few
(dozen, hundred) TB, but how many distinct words are there? The output data
set should be at least a thousand times smaller. If you've got the hardware
to do that initial word count step on a few TB of data, the second pass will
not be a major performance concern.

MapReduce is, to borrow from a tired analogy, a lot like driving a freight
train. The raw speed of any given algorithm on it might not sound
impressive, but even if its got a much higher constant-factor of time
associated with it, the ability to provide nearly-flat parallelism as your
data set grows really large more than makes up for it in the long run.
- Aaron

On Thu, Dec 25, 2008 at 2:22 AM, Jim Twensky <[email protected]> wrote:

> Hello again,
>
> I think I found an answer to my question. If I write a new
> WritableComparable object that extends IntWritable and then overwrite the
> compareTo method, I can change the sorting order from ascending to
> descending. That will solve my problem for getting the top 100 most
> frequent
> words at each combiner/reducer.
>
> Jim
>
> On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky <[email protected]>
> wrote:
>
> > Hi Aaron,
> >
> > Thanks for the advice. I actually thought of using multiple combiners and
> a
> > single reducer but I was worried about the key sorting phase to be a
> vaste
> > for my purpose. If the input is just a bunch of (word,count) pairs which
> is
> > in the order of TeraBytes, wouldn't sorting be an overkill? That's why I
> > thought a single serial program might perform better but I'm not sure how
> > long it would take to sort the keys in such a case so probably it is
> nothing
> > beyond speculation and I should go and give it a try to see how well it
> > performs.
> >
> > Secondly, I didn't quite understand how I can take advantage of the
> sorted
> > keys if I use an inverting mapper that transforms (k,v) --> (v,k) pairs.
> In
> > both cases, the combiners and the single reducer will still have to
> iterate
> > over all the (v,k) pairs to find the top 100 right? Or is there a way to
> say
> > something like "Give me the last 100 keys" at each reducer/combiner?
> >
> > Thanks in advance,
> > Jim
> >
> >
> > On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball <[email protected]>
> wrote:
> >
> >> (Addendum to my own post -- an identity mapper is probably not what you
> >> want. You'd actually want an inverting mapper that transforms (k, v) -->
> >> (v,
> >> k), to take advantage of the key-based sorting.)
> >>
> >> - Aaron
> >>
> >> On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <[email protected]>
> >> wrote:
> >>
> >> > Hi Jim,
> >> >
> >> > The ability to perform locking of shared mutable state is a distinct
> >> > anti-goal of the MapReduce paradigm. One of the major benefits of
> >> writing
> >> > MapReduce programs is knowing that you don't have to worry about
> >> deadlock in
> >> > your code. If mappers could lock objects, then the failure and restart
> >> > semantics of individual tasks would be vastly more complicated. (What
> >> > happens if a map task crashes after it obtains a lock? Does it
> >> automatically
> >> > release the lock? Does some rollback mechanism undo everything that
> >> happened
> >> > after the lock was acquired? How would that work if--by
> definition--the
> >> > mapper node is no longer available?)
> >> >
> >> > A word frequency histogram function can certainly be written in
> >> MapReduce
> >> > without such state. You've got the right intuition, but a serial
> program
> >> is
> >> > not necessarily the best answer. Take the existing word count program.
> >> This
> >> > converts bags of words into (word, count) pairs. Then pass this
> through
> >> a
> >> > second pass, via an identity mapper to a set of combiners that each
> emit
> >> the
> >> > 100 most frequent words, to a single reducer that emits the 100 most
> >> > frequent words obtained by the combiners.
> >> >
> >> > Many other more complicated problems which seem to require shared
> state,
> >> in
> >> > truth, only require a second (or n+1'th) MapReduce pass. Adding
> multiple
> >> > passes is a very valid technique for building more complex dataflows.
> >> >
> >> > Cheers,
> >> > - Aaron
> >> >
> >> >
> >> >
> >> > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <[email protected]
> >> >wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> I was wondering if Hadoop provides thread safe shared variables that
> >> can
> >> >> be
> >> >> accessed from individual mappers/reducers along with a proper locking
> >> >> mechanism. To clarify things, let's say that in the word count
> example,
> >> I
> >> >> want to know the word that has the highest frequency and how many
> times
> >> it
> >> >> occured. I believe that the latter can be done using the counters
> that
> >> >> come
> >> >> with the Hadoop framework but I don't know how to get the word itself
> >> as a
> >> >> String. Of course, the problem can be more complicated like the top
> 100
> >> >> words or so.
> >> >>
> >> >> I thought of writing a serial program which can go over the final
> >> output
> >> >> of
> >> >> the word count but this wouldn't be a good idea if the output file
> gets
> >> >> too
> >> >> large. However, if there is a way to define and use shared variables,
> >> this
> >> >> would be really easy to do on the fly during the word count's reduce
> >> >> phase.
> >> >>
> >> >> Thanks,
> >> >> Jim
> >> >>
> >> >
> >> >
> >>
> >
> >
>

Reply via email to