On Tue, May 3, 2011 at 1:48 AM, elton sky <eltonsky9...@gmail.com> wrote:
> Pls correct me if I am wrong. One of the important assumptions of hadoop > map > reduce is: map's output should be smaller than input. No, that isn't a valid assumption. MapReduce workloads can roughly be divided into three categories: 1. scans (map input > shuffle data) 2. sorts (map input = shuffle data = output data) 3. index builds ( map input < shuffle data) Scans are the most common, but far from the only case. -- Owen