Re: difference between partition by and distribute by in rank()

Nitin Pawar Fri, 11 Jul 2014 01:33:07 -0700

In general principle,
distribute by  ensures each of N reducers gets non-overlapping ranges of X ,
but doesn't sort the output of each reducer. You end up with N or unsorted
files with non-overlapping ranges. So this is more of a horizontal
distribution of data.

In my view,
Partition by is more based on values so its vertical distribution of data.

I may be wrong in understanding this

On Fri, Jul 11, 2014 at 1:38 PM, Eric Chu <e...@rocketfuel.com> wrote:

> Does anyone know what
>
> *rank() over(distribute by p_mfgr sort by p_name) *
>
> does exactly and how it's different from
>
> *rank() over(partition by p_mfgr order by p_name)*?
>
> Thanks,
>
> Eric
>
>

-- 
Nitin Pawar

Re: difference between partition by and distribute by in rank()

Reply via email to