Re: difference between partition by and distribute by in rank()

Eric Chu Fri, 11 Jul 2014 11:21:06 -0700

Thanks for the responses. I understand DISTRIBUTE BY and SORT BY in the
normal case (as described in the Hive doc); I just don't understand their
behavior in the OVER clause with RANK, which apparently you can do. See
ql/src/test/queries/clientpositive/windowing.q for example.

Yes I saw Edward's Blog. His solution is a UDF, while Hive's rank() is
UDAF. Also, if you use his function, let's say you do

DISTRIBUTE BY user, SORT BY score DESC

then RANK(user) on that

The UDF would just give a different rank for each row within the same user
group, but it can't give the same rank for different rows in the same user
group that have the same score. (

Hive's rank() OVER PARTITION BY seems to support this in the iterate()
method. Also, the function is applied to a single partition (in this case,
per user group), as opposed to a single reducer that may see different
partitions, and the prev/current row comparison is done on the PARTITION BY
columns.

The actual problem I'm hitting is that when I use Hive's rank(), I run into
OOM issue when it adds a rank to an ArrayList in RankBuffer class in
GenericUDAFRank.java. The same problem occurs with both RANK OVER /
DISTRIBUTE BY / SORT BY and RANK OVER / PARTITION BY / ORDER BY. So I want
to understand if there's a mitigation other than increasing the heap.

If not, I'll have to go back to the UDF approach, which just outputs a rank
for each row so it doesn't have the OOM issue. But since this is going
through rows on a reducer, I'd need to distinguish between (with DISTRIBUTE
BY columns and the SORT BY columns in the UDF, so that it supports giving
the same rank for the rows with the same SORT BY column values.

On Fri, Jul 11, 2014 at 2:31 AM, Joshi, Rekha <rekha_jo...@intuit.com>
wrote:

>  Hi,
>
>  Quite known, are order and sort reducer nuances related to total order
> in final output.
>
>  One could *simulate* rank over() functionality by using* distribute by
> () /sort by() on datasets*{cluster by/ if same key} as in Edward Blog
> <http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive>
> .
>
>  From Hive0.11, you can have directly
> <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WindowingandAnalyticsFunctions>
>  call
> rank() over (partition ..order..).
>
>  AFAIK, in hive rank over() syntax uses (partition ..order..) only.
>
>  Thanks
> Rekha
>
>   From: Eric Chu <e...@rocketfuel.com>
> Reply-To: "user@hive.apache.org" <user@hive.apache.org>
> Date: Friday, July 11, 2014 at 1:38 PM
> To: "hive-u...@hadoop.apache.org" <hive-u...@hadoop.apache.org>
> Subject: difference between partition by and distribute by in rank()
>
>   Does anyone know what
>
> *rank() over(distribute by p_mfgr sort by p_name) *
>
> does exactly and how it's different from
>
> *rank() over(partition by p_mfgr order by p_name)*?
>
> Thanks,
>
> Eric
>
>

Re: difference between partition by and distribute by in rank()

Reply via email to