In general principle, distribute by ensures each of N reducers gets non-overlapping ranges of X , but doesn't sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges. So this is more of a horizontal distribution of data.
In my view, Partition by is more based on values so its vertical distribution of data. I may be wrong in understanding this On Fri, Jul 11, 2014 at 1:38 PM, Eric Chu <e...@rocketfuel.com> wrote: > Does anyone know what > > *rank() over(distribute by p_mfgr sort by p_name) * > > does exactly and how it's different from > > *rank() over(partition by p_mfgr order by p_name)*? > > Thanks, > > Eric > > -- Nitin Pawar