Re: Need help understanding the source

Amr Awadallah Tue, 07 Jul 2009 00:31:17 -0700

To add to Todd/Ted's wise words, the Hadoop (and MapReduce) architectsdidn't impose this limitation just for fun, it is very core to enablingHadoop to be as reliable as it is. If the reducer starts processingmapper output immediately and a specific mapper fails then the reducerwould have to know how to undo the specific pieces of work related tothe failed mapper, not trivial at all. That said, the combiners doachieve a bit of that for you, as they start working immediately on themap out, but on a per-mapper basis (not global), so easy to handlefailure in that case (you just redo that mapper and the combining for it).


-- amr


Ted Dunning wrote:

I would consider this to be a very delicate optimization with little utility
in the real world.  It is very, very rare to reliably know how many records
the reducer will see.  Getting this wrong would be a disaster.  Getting it
right would be very difficult in almost all cases.

Moreover, this assumption is baked all through the map-reduce design and
thus doing a change to allow reduce to go ahead is likely to be really
tricky (not that I know this for a fact).


On Mon, Jul 6, 2009 at 11:14 AM, Naresh Rapolu <[email protected]

wrote:

My aim is to make the reduce move ahead with reduction as and when it gets
the data required, instead of waiting for all the maps to complete.  If it
knows how many records it needs and compares it with number of records it
has got until now,  it can move on once they become equal without waiting
for all the maps to finish.

Re: Need help understanding the source

Reply via email to