Hello again,

I think I found an answer to my question. If I write a new
WritableComparable object that extends IntWritable and then overwrite the
compareTo method, I can change the sorting order from ascending to
descending. That will solve my problem for getting the top 100 most frequent
words at each combiner/reducer.

Jim

On Wed, Dec 24, 2008 at 12:19 PM, Jim Twensky <[email protected]> wrote:

> Hi Aaron,
>
> Thanks for the advice. I actually thought of using multiple combiners and a
> single reducer but I was worried about the key sorting phase to be a vaste
> for my purpose. If the input is just a bunch of (word,count) pairs which is
> in the order of TeraBytes, wouldn't sorting be an overkill? That's why I
> thought a single serial program might perform better but I'm not sure how
> long it would take to sort the keys in such a case so probably it is nothing
> beyond speculation and I should go and give it a try to see how well it
> performs.
>
> Secondly, I didn't quite understand how I can take advantage of the sorted
> keys if I use an inverting mapper that transforms (k,v) --> (v,k) pairs. In
> both cases, the combiners and the single reducer will still have to iterate
> over all the (v,k) pairs to find the top 100 right? Or is there a way to say
> something like "Give me the last 100 keys" at each reducer/combiner?
>
> Thanks in advance,
> Jim
>
>
> On Wed, Dec 24, 2008 at 3:44 AM, Aaron Kimball <[email protected]> wrote:
>
>> (Addendum to my own post -- an identity mapper is probably not what you
>> want. You'd actually want an inverting mapper that transforms (k, v) -->
>> (v,
>> k), to take advantage of the key-based sorting.)
>>
>> - Aaron
>>
>> On Wed, Dec 24, 2008 at 4:43 AM, Aaron Kimball <[email protected]>
>> wrote:
>>
>> > Hi Jim,
>> >
>> > The ability to perform locking of shared mutable state is a distinct
>> > anti-goal of the MapReduce paradigm. One of the major benefits of
>> writing
>> > MapReduce programs is knowing that you don't have to worry about
>> deadlock in
>> > your code. If mappers could lock objects, then the failure and restart
>> > semantics of individual tasks would be vastly more complicated. (What
>> > happens if a map task crashes after it obtains a lock? Does it
>> automatically
>> > release the lock? Does some rollback mechanism undo everything that
>> happened
>> > after the lock was acquired? How would that work if--by definition--the
>> > mapper node is no longer available?)
>> >
>> > A word frequency histogram function can certainly be written in
>> MapReduce
>> > without such state. You've got the right intuition, but a serial program
>> is
>> > not necessarily the best answer. Take the existing word count program.
>> This
>> > converts bags of words into (word, count) pairs. Then pass this through
>> a
>> > second pass, via an identity mapper to a set of combiners that each emit
>> the
>> > 100 most frequent words, to a single reducer that emits the 100 most
>> > frequent words obtained by the combiners.
>> >
>> > Many other more complicated problems which seem to require shared state,
>> in
>> > truth, only require a second (or n+1'th) MapReduce pass. Adding multiple
>> > passes is a very valid technique for building more complex dataflows.
>> >
>> > Cheers,
>> > - Aaron
>> >
>> >
>> >
>> > On Wed, Dec 24, 2008 at 3:28 AM, Jim Twensky <[email protected]
>> >wrote:
>> >
>> >> Hello,
>> >>
>> >> I was wondering if Hadoop provides thread safe shared variables that
>> can
>> >> be
>> >> accessed from individual mappers/reducers along with a proper locking
>> >> mechanism. To clarify things, let's say that in the word count example,
>> I
>> >> want to know the word that has the highest frequency and how many times
>> it
>> >> occured. I believe that the latter can be done using the counters that
>> >> come
>> >> with the Hadoop framework but I don't know how to get the word itself
>> as a
>> >> String. Of course, the problem can be more complicated like the top 100
>> >> words or so.
>> >>
>> >> I thought of writing a serial program which can go over the final
>> output
>> >> of
>> >> the word count but this wouldn't be a good idea if the output file gets
>> >> too
>> >> large. However, if there is a way to define and use shared variables,
>> this
>> >> would be really easy to do on the fly during the word count's reduce
>> >> phase.
>> >>
>> >> Thanks,
>> >> Jim
>> >>
>> >
>> >
>>
>
>

Reply via email to