Re: Could I collect results from Map-Reduce then output myself ?

Kunsheng Chen Mon, 15 Jun 2009 09:45:19 -0700

thanks, Aaron.

I got some question on my situation.  My first Map/Reduce pass would finish all 
work I need.  The output results would be divided into 2 groups, and each 
result has a priority in its group.


What I was thinking was to do some algorithm to pick the final output from each 
of this group according to its priority.

A second map pass would definitely help to differentiate them into 2 groups, 
but I don't know what is the best place to put my algorithm to sort each group 
and then pick up the results (My algorithm has to consider two group at the 
same time). 

Looks like I don't get a place in hadoop that could deal with all output in the 
same function at one time, map/reduce function only concerns certain 'key' at 
one time.

Does anyone have some idea ?

Thanks,

-Kun








--- On Mon, 6/15/09, Aaron Kimball <[email protected]> wrote:

> From: Aaron Kimball <[email protected]>
> Subject: Re: Could I collect results from Map-Reduce then output myself ?
> To: [email protected]
> Date: Monday, June 15, 2009, 4:08 PM
> If you can make the decision locally,
> then it should just be performed in
> the reducer itself:
> 
> if (guard) {
>   output.collect(k, v);
> }
> 
> 
> If you need to know what results will be generated by other
> calls to
> reduce() on that same machine, then you'll need to be a bit
> more clever. If
> you know that for all jobs you'll run, your results will
> always fit in a
> buffer in RAM, then you can put your values in an ArrayList
> or something and
> then override Reducer.close() to dump your values into the
> output collector.
> Then call super.close().
> 
> If you may need to generate more data than will fit in RAM,
> or you need the
> results of multiple nodes to conference together, then this
> means you almost
> certainly want a second MapReduce pass. Your first pass
> should collect() all
> the results it generates. Then in a second pass, use an
> identity mapper that
> causes the shuffler to sort the data along some axis so
> that the most
> desirable data comes first. Then output.collect() this data
> a second time in
> the second reducer, discarding the data that doesn't meet
> your criterion.
> The input path to your second MR is the output path from
> the first one.
> 
> - Aaron
> 
> On Sun, Jun 14, 2009 at 4:02 PM, Kunsheng Chen <[email protected]>
> wrote:
> 
> >
> > Hi everyone,
> >
> > I am doing a map-reduce program, it is working good.
> >
> > Now I am thinking of inserting my own algorithm to
> pick the output results
> > after 'Reduce' other than simply use
> 'output.colllect()' in Reduce to output
> > all results.
> >
> > The only thing I could think is read the output file
> after JobClient
> > finishing and does some Java program for that, but I
> am not sure whether
> > there are efficient method provided by hadoop to
> handle that.
> >
> >
> > Any idea is well appreciated,
> >
> > -Kun
> >
> >
> >
> >
>

Re: Could I collect results from Map-Reduce then output myself ?

Reply via email to