I wrote this query few minutes back-

*select bid, pid, time from (*
*    select bid, pid, time, rank() over (partition by bid, pid order by
time desc) as k *
*    from table1 ) as x *
*where k  <=3*
*order by bid, pid, time desc*


Do you think this query will work with my Rank function that I provided
below?




On Thu, Jul 19, 2012 at 4:05 PM, comptech geeky <comptechge...@gmail.com>wrote:

> Hi Igor,
>
> I am new to HiveQL world. Don't know that much basically. Currently I have
> my Rank UDF function like this-
>
> *public final class Rank extends UDF{*
> *    private int  counter;*
> *    private String last_key;*
> *    public int evaluate(final String key){*
> *  if ( !key.equalsIgnoreCase(this.last_key) ) {*
> *     this.counter = 0;*
> *     this.last_key = key;*
> *  }*
> *  return this.counter++;*
> *    }*
> *}*
> *
> *
> And I tried that query after removing pid from distribute by and sort by
> clause, but I got the below output which is wrong again-
>
> *1345653 330760137950    0*
> *1345653 330760137950    1*
> *1345653 330760137950    2*
>
> But I need output something like this-
>
> *1345653    330760137950       2012-07-09 21:43:29*
> *1345653    330760137950       2012-07-09 21:42:29*
> *1345653    330760137950       2012-07-09 21:41:29*
> 1345653    110909316904       2012-07-09 21:29:06
> 1345653    221065796761       2012-07-09 19:31:48
>
>
> Any help will be appreciated.
>
>
>
>
> On Thu, Jul 19, 2012 at 4:00 PM, Igor Tatarinov <i...@decide.com> wrote:
>
>> Actually, never mind. Looks like you need to partition by both bid and
>> pid. In that case, your problem is that rank() has to handle a combined
>> bid+pid key. So first you need to create a combined key, partition by that
>> key and pass it to your rank() function (assuming rank() knows to reset on
>> a new key). You can cast bid and pid to string and concatenate them with a
>> separator (bid_pid) to get a single partitioning key. Hope this makes sense.
>>
>> On Thu, Jul 19, 2012 at 3:57 PM, Igor Tatarinov <i...@decide.com> wrote:
>>
>>> Sorry, just pid needs to be dropped from both DISTRIBUTE and SORT
>>> clauses.
>>> Your very first query was correct except for the nested subquery part.
>>> (You don't need a double-nested subquery.)
>>>
>>> On Thu, Jul 19, 2012 at 3:48 PM, comptech geeky <comptechge...@gmail.com
>>> > wrote:
>>>
>>>> Hi Igor,
>>>>
>>>> I am not sure what I have to remove from Distribute By as in distribute
>>>> by we have bid, pid and you said remove bid and time from distribute by and
>>>> it doesn't have time
>>>>
>>>> *SELECT bid, pid, rank FROM *
>>>>       *(SELECT bid, pid, rank(bid) rank, time, UNIX_TIMESTAMP(time)
>>>> FROM
>>>> *
>>>>   *
>>>>            ( SELECT bid, pid, time FROM table1*
>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>> '2012-07-09' *
>>>> *            DISTRIBUTE BY bid,pid*
>>>> *            SORT BY bid,pid, time desc) A
>>>> *
>>>> *
>>>>       ) B
>>>> *
>>>> *WHERE rank < 3;*
>>>>
>>>>
>>>> And also I tried running the above query as it is. I am not getting
>>>> expected output instead of that I am getting output like this which is
>>>> wrong If you compare my expected output with the below output-
>>>>
>>>>  *1345653 110909316904    0*
>>>> *1345653 221065796761    1*
>>>> *1345653 330760137950    2*
>>>>
>>>>
>>>> On Thu, Jul 19, 2012 at 3:43 PM, Igor Tatarinov <i...@decide.com>wrote:
>>>>
>>>>> Remove pid,time from DISTRIBUTE BY.
>>>>>
>>>>> On Thu, Jul 19, 2012 at 1:45 PM, comptech geeky <
>>>>> comptechge...@gmail.com> wrote:
>>>>>
>>>>>> Modified Query that I wrote and its not working as expected output is.
>>>>>>
>>>>>> *
>>>>>> *
>>>>>> *SELECT bid, pid, rank(bid), time, UNIX_TIMESTAMP(time)*
>>>>>> *FROM (*
>>>>>> *    SELECT bid, pid, time*
>>>>>> *    FROM table1*
>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(time) as int))) =
>>>>>> '2012-07-09'*
>>>>>> *    DISTRIBUTE BY bid,pid,time*
>>>>>> *    SORT BY bid, time desc*
>>>>>> *) a*
>>>>>> *WHERE rank(bid) < 3;*
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 19, 2012 at 1:41 PM, comptech geeky <
>>>>>> comptechge...@gmail.com> wrote:
>>>>>>
>>>>>>> I wrote this query after modifying it-
>>>>>>>
>>>>>>> *SELECT buyer_id, item_id, rank(buyer_id), created_time,
>>>>>>> UNIX_TIMESTAMP(created_time)*
>>>>>>> *FROM (*
>>>>>>> *    SELECT buyer_id, item_id, created_time*
>>>>>>> *    FROM testingtable1*
>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time) as
>>>>>>> int))) = '2012-07-09'*
>>>>>>> *    DISTRIBUTE BY buyer_id,item_id*
>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>> *) a*
>>>>>>> *WHERE rank(buyer_id) < 3;*
>>>>>>>
>>>>>>> And the output I got is which is sligtly wrong as it is missing two
>>>>>>> rows-
>>>>>>>
>>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>
>>>>>>> These two rows are missing-
>>>>>>>
>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>>  So full output should be like this-
>>>>>>>
>>>>>>> *1345653    330760137950       2012-07-09 21:43:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:42:29*
>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Jul 19, 2012 at 1:29 PM, comptech geeky <
>>>>>>> comptechge...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Can you show me the exact query that I need to do for this
>>>>>>>> particular problem consideing my scenario? It will be of great help to 
>>>>>>>> me.
>>>>>>>> As I am new to HiveQL.
>>>>>>>>
>>>>>>>> I need TOP 3 for those if BID and PID gets matched but with
>>>>>>>> different timestamp.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jul 19, 2012 at 1:15 PM, Philip Tromans <
>>>>>>>> philip.j.trom...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Your rank() is being evaluated map side. Put your distribute by
>>>>>>>>> and sort by in an inner query, and then evaluate your rank() in an 
>>>>>>>>> outer
>>>>>>>>> query.
>>>>>>>>>
>>>>>>>>> Phil.
>>>>>>>>> On Jul 19, 2012 9:00 PM, "comptech geeky" <comptechge...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This is the below data in my Table1
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> BID       PID                       TIME
>>>>>>>>>>
>>>>>>>>>> --------------+-------------------------+--------------------------------
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:42:29
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:43:29
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:40:29
>>>>>>>>>> 1345653   330760137950       2012-07-09 21:41:29
>>>>>>>>>> 1345653   110909316904       2012-07-09 21:29:06
>>>>>>>>>> 1345653   221065796761       2012-07-09 19:31:48
>>>>>>>>>>
>>>>>>>>>> So If I need to clarify the above scenario- I have data in above
>>>>>>>>>> table like this-
>>>>>>>>>> For USER *`1345653` *I have this PID `*330760137950` *four times
>>>>>>>>>> but with different timestamps in red color. So I need the output 
>>>>>>>>>> something
>>>>>>>>>> like this-
>>>>>>>>>>
>>>>>>>>>> Output that I need:-
>>>>>>>>>>
>>>>>>>>>> *1345653    330760137950       2012-07-09 21:43:29 *
>>>>>>>>>> *1345653    330760137950       2012-07-09 21:42:29 *
>>>>>>>>>> *1345653    330760137950       2012-07-09 21:41:29*
>>>>>>>>>> 1345653    110909316904       2012-07-09 21:29:06
>>>>>>>>>> 1345653    221065796761       2012-07-09 19:31:48
>>>>>>>>>>
>>>>>>>>>> So Basically If BID and PID are same but with different
>>>>>>>>>> timestamps, then I need TOP 3 sorted with TIME in descending order
>>>>>>>>>>
>>>>>>>>>> And for this I created rank UDF (User Defined Function). And I
>>>>>>>>>> wrote the below query but its not working for me. Can anyone help me 
>>>>>>>>>> on
>>>>>>>>>> this?
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *
>>>>>>>>>> *SELECT buyer_id, item_id, created_time*
>>>>>>>>>> *    FROM table1*
>>>>>>>>>> * where to_date(from_unixtime(cast(UNIX_TIMESTAMP(created_time)
>>>>>>>>>> as int))) = '2012-07-09'*
>>>>>>>>>> *    DISTRIBUTE BY buyer_id*
>>>>>>>>>> *    SORT BY buyer_id, created_time desc*
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to