I wrote this query few minutes back- *select bid, pid, time from (* * select bid, pid, time, rank() over (partition by bid, pid order by time desc) as k * * from table1 ) as x * *where k <=3* *order by bid, pid, time desc*
Do you think this query will work with my Rank function that I provided below? On Thu, Jul 19, 2012 at 4:05 PM, comptech geeky <comptechge...@gmail.com>wrote: > Hi Igor, > > I am new to HiveQL world. Don't know that much basically. Currently I have > my Rank UDF function like this- > > *public final class Rank extends UDF{* > * private int counter;* > * private String last_key;* > * public int evaluate(final String key){* > * if ( !key.equalsIgnoreCase(this.last_key) ) {* > * this.counter = 0;* > * this.last_key = key;* > * }* > * return this.counter++;* > * }* > *}* > * > * > And I tried that query after removing pid from distribute by and sort by > clause, but I got the below output which is wrong again- > > *1345653 330760137950 0* > *1345653 330760137950 1* > *1345653 330760137950 2* > > But I need output something like this- > > *1345653 330760137950 2012-07-09 21:43:29* > *1345653 330760137950 2012-07-09 21:42:29* > *1345653 330760137950 2012-07-09 21:41:29* > 1345653 110909316904 2012-07-09 21:29:06 > 1345653 221065796761 2012-07-09 19:31:48 > > > Any help will be appreciated. > > > > > On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov <i...@decide.com> wrote: > >> Actually, never mind. Looks like you need to partition by both bid and >> pid. In that case, your problem is that rank() has to handle a combined >> bid+pid key. So first you need to create a combined key, partition by that >> key and pass it to your rank() function (assuming rank() knows to reset on >> a new key). You can cast bid and pid to string and concatenate them with a >> separator (bid_pid) to get a single partitioning key. Hope this makes sense. >> >> On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <i...@decide.com> wrote: >> >>> Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT >>> clauses. >>> Your very first query was correct except for the nested subquery part. >>> (You don't need a double-nested subquery.) >>> >>> On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky <comptechge...@gmail.com >>> > wrote: >>> >>>> Hi Igor, >>>> >>>> I am not sure what I have to remove from Distribute By as in distribute >>>> by we have bid, pid and you said remove bid and time from distribute by and >>>> it doesn't have time >>>> >>>> *SELECT bid, pid, rank FROM * >>>> *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) >>>> FROM >>>> * >>>> * >>>> ( SELECT bid, pid, time FROM table1* >>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = >>>> '2012-07-09' * >>>> * DISTRIBUTE BY bid,pid* >>>> * SORT BY bid,pid, time desc) A >>>> * >>>> * >>>> ) B >>>> * >>>> *WHERE rank < 3;* >>>> >>>> >>>> And also I tried running the above query as it is. I am not getting >>>> expected output instead of that I am getting output like this which is >>>> wrong If you compare my expected output with the below output- >>>> >>>> *1345653 110909316904 0* >>>> *1345653 221065796761 1* >>>> *1345653 330760137950 2* >>>> >>>> >>>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <i...@decide.com>wrote: >>>> >>>>> Remove pid,time from DISTRIBUTE BY. >>>>> >>>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky < >>>>> comptechge...@gmail.com> wrote: >>>>> >>>>>> Modified Query that I wrote and its not working as expected output is. >>>>>> >>>>>> * >>>>>> * >>>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* >>>>>> *FROM (* >>>>>> * SELECT bid, pid, time* >>>>>> * FROM table1* >>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = >>>>>> '2012-07-09'* >>>>>> * DISTRIBUTE BY bid,pid,time* >>>>>> * SORT BY bid, time desc* >>>>>> *) a* >>>>>> *WHERE rank(bid) < 3;* >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky < >>>>>> comptechge...@gmail.com> wrote: >>>>>> >>>>>>> I wrote this query after modifying it- >>>>>>> >>>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time, >>>>>>> UNIX_TIMESTAMP(created_time)* >>>>>>> *FROM (* >>>>>>> * SELECT buyer_id, item_id, created_time* >>>>>>> * FROM testingtable1* >>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as >>>>>>> int))) = '2012-07-09'* >>>>>>> * DISTRIBUTE BY buyer_id,item_id* >>>>>>> * SORT BY buyer_id, created_time desc* >>>>>>> *) a* >>>>>>> *WHERE rank(buyer_id) < 3;* >>>>>>> >>>>>>> And the output I got is which is sligtly wrong as it is missing two >>>>>>> rows- >>>>>>> >>>>>>> *1345653 330760137950 2012-07-09 21:43:29* >>>>>>> *1345653 330760137950 2012-07-09 21:42:29* >>>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>>> >>>>>>> These two rows are missing- >>>>>>> >>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>> >>>>>>> So full output should be like this- >>>>>>> >>>>>>> *1345653 330760137950 2012-07-09 21:43:29* >>>>>>> *1345653 330760137950 2012-07-09 21:42:29* >>>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>> >>>>>>> >>>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky < >>>>>>> comptechge...@gmail.com> wrote: >>>>>>> >>>>>>>> Can you show me the exact query that I need to do for this >>>>>>>> particular problem consideing my scenario? It will be of great help to >>>>>>>> me. >>>>>>>> As I am new to HiveQL. >>>>>>>> >>>>>>>> I need TOP 3 for those if BID and PID gets matched but with >>>>>>>> different timestamp. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans < >>>>>>>> philip.j.trom...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Your rank() is being evaluated map side. Put your distribute by >>>>>>>>> and sort by in an inner query, and then evaluate your rank() in an >>>>>>>>> outer >>>>>>>>> query. >>>>>>>>> >>>>>>>>> Phil. >>>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <comptechge...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> This is the below data in my Table1 >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> BID PID TIME >>>>>>>>>> >>>>>>>>>> --------------+-------------------------+-------------------------------- >>>>>>>>>> 1345653 330760137950 2012-07-09 21:42:29 >>>>>>>>>> 1345653 330760137950 2012-07-09 21:43:29 >>>>>>>>>> 1345653 330760137950 2012-07-09 21:40:29 >>>>>>>>>> 1345653 330760137950 2012-07-09 21:41:29 >>>>>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>>>>> >>>>>>>>>> So If I need to clarify the above scenario- I have data in above >>>>>>>>>> table like this- >>>>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times >>>>>>>>>> but with different timestamps in red color. So I need the output >>>>>>>>>> something >>>>>>>>>> like this- >>>>>>>>>> >>>>>>>>>> Output that I need:- >>>>>>>>>> >>>>>>>>>> *1345653 330760137950 2012-07-09 21:43:29 * >>>>>>>>>> *1345653 330760137950 2012-07-09 21:42:29 * >>>>>>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>>>>> >>>>>>>>>> So Basically If BID and PID are same but with different >>>>>>>>>> timestamps, then I need TOP 3 sorted with TIME in descending order >>>>>>>>>> >>>>>>>>>> And for this I created rank UDF (User Defined Function). And I >>>>>>>>>> wrote the below query but its not working for me. Can anyone help me >>>>>>>>>> on >>>>>>>>>> this? >>>>>>>>>> * >>>>>>>>>> * >>>>>>>>>> * >>>>>>>>>> * >>>>>>>>>> *SELECT buyer_id, item_id, created_time* >>>>>>>>>> * FROM table1* >>>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) >>>>>>>>>> as int))) = '2012-07-09'* >>>>>>>>>> * DISTRIBUTE BY buyer_id* >>>>>>>>>> * SORT BY buyer_id, created_time desc* >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >