Thanks for the responses. I understand DISTRIBUTE BY and SORT BY in the normal case (as described in the Hive doc); I just don't understand their behavior in the OVER clause with RANK, which apparently you can do. See ql/src/test/queries/clientpositive/windowing.q for example.
Yes I saw Edward's Blog. His solution is a UDF, while Hive's rank() is UDAF. Also, if you use his function, let's say you do DISTRIBUTE BY user, SORT BY score DESC then RANK(user) on that The UDF would just give a different rank for each row within the same user group, but it can't give the same rank for different rows in the same user group that have the same score. ( Hive's rank() OVER PARTITION BY seems to support this in the iterate() method. Also, the function is applied to a single partition (in this case, per user group), as opposed to a single reducer that may see different partitions, and the prev/current row comparison is done on the PARTITION BY columns. The actual problem I'm hitting is that when I use Hive's rank(), I run into OOM issue when it adds a rank to an ArrayList in RankBuffer class in GenericUDAFRank.java. The same problem occurs with both RANK OVER / DISTRIBUTE BY / SORT BY and RANK OVER / PARTITION BY / ORDER BY. So I want to understand if there's a mitigation other than increasing the heap. If not, I'll have to go back to the UDF approach, which just outputs a rank for each row so it doesn't have the OOM issue. But since this is going through rows on a reducer, I'd need to distinguish between (with DISTRIBUTE BY columns and the SORT BY columns in the UDF, so that it supports giving the same rank for the rows with the same SORT BY column values. On Fri, Jul 11, 2014 at 2:31 AM, Joshi, Rekha <rekha_jo...@intuit.com> wrote: > Hi, > > Quite known, are order and sort reducer nuances related to total order > in final output. > > One could *simulate* rank over() functionality by using* distribute by > () /sort by() on datasets*{cluster by/ if same key} as in Edward Blog > <http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/doing_rank_with_hive> > . > > From Hive0.11, you can have directly > <https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics#LanguageManualWindowingAndAnalytics-WindowingandAnalyticsFunctions> > call > rank() over (partition ..order..). > > AFAIK, in hive rank over() syntax uses (partition ..order..) only. > > Thanks > Rekha > > From: Eric Chu <e...@rocketfuel.com> > Reply-To: "user@hive.apache.org" <user@hive.apache.org> > Date: Friday, July 11, 2014 at 1:38 PM > To: "hive-u...@hadoop.apache.org" <hive-u...@hadoop.apache.org> > Subject: difference between partition by and distribute by in rank() > > Does anyone know what > > *rank() over(distribute by p_mfgr sort by p_name) * > > does exactly and how it's different from > > *rank() over(partition by p_mfgr order by p_name)*? > > Thanks, > > Eric > >