Re: Why mergeParts() is not parallel with collect() on map?

Owen O'Malley Tue, 03 May 2011 08:43:41 -0700

On Tue, May 3, 2011 at 1:48 AM, elton sky <eltonsky9...@gmail.com> wrote:


> Pls correct me if I am wrong. One of the important assumptions of hadoop
> map
> reduce is: map's output should be smaller than input.


No, that isn't a valid assumption. MapReduce workloads can roughly be
divided into three categories:
1. scans (map input > shuffle data)
2. sorts (map input = shuffle data = output data)
3. index builds ( map input < shuffle data)

Scans are the most common, but far from the only case.

-- Owen

Re: Why mergeParts() is not parallel with collect() on map?

Reply via email to