Actually, never mind. Looks like you need to partition by both bid and pid. In that case, your problem is that rank() has to handle a combined bid+pid key. So first you need to create a combined key, partition by that key and pass it to your rank() function (assuming rank() knows to reset on a new key). You can cast bid and pid to string and concatenate them with a separator (bid_pid) to get a single partitioning key. Hope this makes sense.
On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <i...@decide.com> wrote: > Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT clauses. > Your very first query was correct except for the nested subquery part. > (You don't need a double-nested subquery.) > > On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky > <comptechge...@gmail.com>wrote: > >> Hi Igor, >> >> I am not sure what I have to remove from Distribute By as in distribute >> by we have bid, pid and you said remove bid and time from distribute by and >> it doesn't have time >> >> *SELECT bid, pid, rank FROM * >> *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time) >> FROM >> * >> * >> ( SELECT bid, pid, time FROM table1* >> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = >> '2012-07-09' * >> * DISTRIBUTE BY bid,pid* >> * SORT BY bid,pid, time desc) A >> * >> * >> ) B >> * >> *WHERE rank < 3;* >> >> >> And also I tried running the above query as it is. I am not getting >> expected output instead of that I am getting output like this which is >> wrong If you compare my expected output with the below output- >> >> *1345653 110909316904 0* >> *1345653 221065796761 1* >> *1345653 330760137950 2* >> >> >> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <i...@decide.com> wrote: >> >>> Remove pid,time from DISTRIBUTE BY. >>> >>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <comptechge...@gmail.com >>> > wrote: >>> >>>> Modified Query that I wrote and its not working as expected output is. >>>> >>>> * >>>> * >>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)* >>>> *FROM (* >>>> * SELECT bid, pid, time* >>>> * FROM table1* >>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) = >>>> '2012-07-09'* >>>> * DISTRIBUTE BY bid,pid,time* >>>> * SORT BY bid, time desc* >>>> *) a* >>>> *WHERE rank(bid) < 3;* >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky < >>>> comptechge...@gmail.com> wrote: >>>> >>>>> I wrote this query after modifying it- >>>>> >>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time, >>>>> UNIX_TIMESTAMP(created_time)* >>>>> *FROM (* >>>>> * SELECT buyer_id, item_id, created_time* >>>>> * FROM testingtable1* >>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as >>>>> int))) = '2012-07-09'* >>>>> * DISTRIBUTE BY buyer_id,item_id* >>>>> * SORT BY buyer_id, created_time desc* >>>>> *) a* >>>>> *WHERE rank(buyer_id) < 3;* >>>>> >>>>> And the output I got is which is sligtly wrong as it is missing two >>>>> rows- >>>>> >>>>> *1345653 330760137950 2012-07-09 21:43:29* >>>>> *1345653 330760137950 2012-07-09 21:42:29* >>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>> >>>>> These two rows are missing- >>>>> >>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>> >>>>> So full output should be like this- >>>>> >>>>> *1345653 330760137950 2012-07-09 21:43:29* >>>>> *1345653 330760137950 2012-07-09 21:42:29* >>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>> >>>>> >>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky < >>>>> comptechge...@gmail.com> wrote: >>>>> >>>>>> Can you show me the exact query that I need to do for this particular >>>>>> problem consideing my scenario? It will be of great help to me. As I am >>>>>> new >>>>>> to HiveQL. >>>>>> >>>>>> I need TOP 3 for those if BID and PID gets matched but with different >>>>>> timestamp. >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans < >>>>>> philip.j.trom...@gmail.com> wrote: >>>>>> >>>>>>> Your rank() is being evaluated map side. Put your distribute by and >>>>>>> sort by in an inner query, and then evaluate your rank() in an outer >>>>>>> query. >>>>>>> >>>>>>> Phil. >>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <comptechge...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> This is the below data in my Table1 >>>>>>>> >>>>>>>> >>>>>>>> BID PID TIME >>>>>>>> >>>>>>>> --------------+-------------------------+-------------------------------- >>>>>>>> 1345653 330760137950 2012-07-09 21:42:29 >>>>>>>> 1345653 330760137950 2012-07-09 21:43:29 >>>>>>>> 1345653 330760137950 2012-07-09 21:40:29 >>>>>>>> 1345653 330760137950 2012-07-09 21:41:29 >>>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>>> >>>>>>>> So If I need to clarify the above scenario- I have data in above >>>>>>>> table like this- >>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times >>>>>>>> but with different timestamps in red color. So I need the output >>>>>>>> something >>>>>>>> like this- >>>>>>>> >>>>>>>> Output that I need:- >>>>>>>> >>>>>>>> *1345653 330760137950 2012-07-09 21:43:29 * >>>>>>>> *1345653 330760137950 2012-07-09 21:42:29 * >>>>>>>> *1345653 330760137950 2012-07-09 21:41:29* >>>>>>>> 1345653 110909316904 2012-07-09 21:29:06 >>>>>>>> 1345653 221065796761 2012-07-09 19:31:48 >>>>>>>> >>>>>>>> So Basically If BID and PID are same but with different timestamps, >>>>>>>> then I need TOP 3 sorted with TIME in descending order >>>>>>>> >>>>>>>> And for this I created rank UDF (User Defined Function). And I >>>>>>>> wrote the below query but its not working for me. Can anyone help me on >>>>>>>> this? >>>>>>>> * >>>>>>>> * >>>>>>>> * >>>>>>>> * >>>>>>>> *SELECT buyer_id, item_id, created_time* >>>>>>>> * FROM table1* >>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as >>>>>>>> int))) = '2012-07-09'* >>>>>>>> * DISTRIBUTE BY buyer_id* >>>>>>>> * SORT BY buyer_id, created_time desc* >>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> >