Hi, I more or less had the same problem and finally got it down by introducing a second subquery. This will guarantee that the rank function is invoked on the reduce phase and that the rank results are properly sorted.
I guess something like this: *SELECT bid, pid, rank FROM * *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) FROM * * ( SELECT bid, pid, time FROM table1* * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = '2012-07-09' * * DISTRIBUTE BY bid,pid* * SORT BY bid,pid, time desc) A * * ) B WHERE rank < 3;* The A. query is carried out at the map phase and makes sure the record arrive in the proper order at the reducer The B. query takes care of applying the rank function (aliased as rank) The outer most query (a second MR job) takes care of filtering the TOP-3 Maybe this will not run at once but hope you get the idea for the proper set up. I modelled it after a query that I had to run, which finally returned the proper TOP-10 for me. Jasper 2012/7/20 comptech geeky <comptechge...@gmail.com> > Can anyone help me with this? I have tried other options by tweaking the > query also. I am not able to achieve my expected output. > > > > On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky > <comptechge...@gmail.com>wrote: > >> Modified Query that I wrote and its not working as expected output is. >> >> * >> * >> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* >> *FROM (* >> * SELECT bid, pid, time* >> * FROM table1* >> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = >> '2012-07-09'* >> * DISTRIBUTE BY bid,pid,time* >> * SORT BY bid, time desc* >> *) a* >> *WHERE rank(bid) < 3;* >> >> >> >> >> >> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky >> <comptechge...@gmail.com>wrote: >> >>> I wrote this query after modifying it- >>> >>> *SELECT buyer_id, item_id, rank(buyer_id), created_time, >>> UNIX_TIMESTAMP(created_time)* >>> *FROM (* >>> * SELECT buyer_id, item_id, created_time* >>> * FROM testingtable1* >>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as >>> int))) = '2012-07-09'* >>> * DISTRIBUTE BY buyer_id,item_id* >>> * SORT BY buyer_id, created_time desc* >>> *) a* >>> *WHERE rank(buyer_id) < 3;* >>> >>> And the output I got is which is sligtly wrong as it is missing two rows- >>> >>> *1345653 330760137950 2012-07-09 21:43:29* >>> *1345653 330760137950 2012-07-09 21:42:29* >>> *1345653 330760137950 2012-07-09 21:41:29* >>> >>> These two rows are missing- >>> >>> 1345653 110909316904 2012-07-09 21:29:06 >>> 1345653 221065796761 2012-07-09 19:31:48 >>> >>> So full output should be like this- >>> >>> *1345653 330760137950 2012-07-09 21:43:29* >>> *1345653 330760137950 2012-07-09 21:42:29* >>> *1345653 330760137950 2012-07-09 21:41:29* >>> 1345653 110909316904 2012-07-09 21:29:06 >>> 1345653 221065796761 2012-07-09 19:31:48 >>> >>> >>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <comptechge...@gmail.com >>> > wrote: >>> >>>> Can you show me the exact query that I need to do for this particular >>>> problem consideing my scenario? It will be of great help to me. As I am new >>>> to HiveQL. >>>> >>>> I need TOP 3 for those if BID and PID gets matched but with different >>>> timestamp. >>>> >>>> >>>> >>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans < >>>> philip.j.trom...@gmail.com> wrote: >>>> >>>>> Your rank() is being evaluated map side. Put your distribute by and >>>>> sort by in an inner query, and then evaluate your rank() in an outer >>>>> query. >>>>> >>>>> Phil. >>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <comptechge...@gmail.com> >>>>> wrote: >>>>> >>>>>> This is the below data in my Table1 >>>>>> >>>>>> >>>>>> BID PID TIME >>>>>> >>>>>> --------------+-------------------------+-------------------------------- >>>>>> 1345653 330760137950 2012-07-09 21:42:29 >>>>>> 1345653 330760137950 2012-07-09 21:43:29 >>>>>> 1345653 330760137950 2012-07-09 21:40:29 >>>>>> 1345653 330760137950 2012-07-09 21:41:29 >>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>> >>>>>> So If I need to clarify the above scenario- I have data in above >>>>>> table like this- >>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times but >>>>>> with different timestamps in red color. So I need the output something >>>>>> like >>>>>> this- >>>>>> >>>>>> Output that I need:- >>>>>> >>>>>> *1345653 330760137950 2012-07-09 21:43:29 * >>>>>> *1345653 330760137950 2012-07-09 21:42:29 * >>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>> >>>>>> So Basically If BID and PID are same but with different timestamps, >>>>>> then I need TOP 3 sorted with TIME in descending order >>>>>> >>>>>> And for this I created rank UDF (User Defined Function). And I wrote >>>>>> the below query but its not working for me. Can anyone help me on this? >>>>>> * >>>>>> * >>>>>> * >>>>>> * >>>>>> *SELECT buyer_id, item_id, created_time* >>>>>> * FROM table1* >>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as >>>>>> int))) = '2012-07-09'* >>>>>> * DISTRIBUTE BY buyer_id* >>>>>> * SORT BY buyer_id, created_time desc* >>>>>> >>>>>> >>>> >>> >> >