RE: Lag function in Hive

karanveer.singh Tue, 10 Apr 2012 22:44:29 -0700

When I try using rownum in my Hive QL query, I get: "Invalid column reference 
rownum". Am I missing something here?


Regards,
Karan 


-----Original Message-----
From: David Kulp [mailto:dk...@fiksu.com] 
Sent: 10 April 2012 20:15
To: user@hive.apache.org
Subject: Re: Lag function in Hive

New here.  Hello all.

Could you try a self-join, possibly also restricted to partitions?

E.g. SELECT t2.value - t1.value FROM mytable t1, mytable t2 WHERE t1.rownum = 
t2.rownum+1 AND t1.partition=foo AND t2.partition=bar

If your data is clustered by rownum, then this join should, in theory, be 
relatively fast -- especially if it makes sense to exploit partitions.

-d 

On Apr 10, 2012, at 10:37 AM, <karanveer.si...@barclays.com> 
<karanveer.si...@barclays.com> wrote:

> Makes sense but is not the distribution across nodes for a chunk of records 
> in that order. 
> 
> If Hive cannot help me do this, is there another way I can do this? I tried 
> generating an identifier using the perl script invoked using Hive but it does 
> not seem to work fine. While the stand alone script works fine, when the 
> record is created in hive using std output from perl - I see 2 records for 
> some of the unique identifiers. I explored the possibility of default data 
> type changes but that does not solve the problem. 
> 
> Regards,
> Karan
> 
> 
> -----Original Message-----
> From: Philip Tromans [mailto:philip.j.trom...@gmail.com] 
> Sent: 10 April 2012 19:48
> To: user@hive.apache.org
> Subject: Re: Lag function in Hive
> 
> Hi Karan,
> 
> To the best of my knowledge, there isn't one. It's also unlikely to
> happen because it's hard to parallelise in a map-reduce way (it
> requires knowing where you are in a result set, and who your
> neighbours are and they in turn need to be present on the same node as
> you which is difficult to guarantee).
> 
> Cheers,
> 
> Phil.
> 
> On 10 April 2012 14:44,  <karanveer.si...@barclays.com> wrote:
>> Hi,
>> 
>> Is there something like a 'lag' function in HIVE? The requirement is to
>> calculate difference for the same column for every 2 subsequent records.
>> 
>> For example.
>> 
>> Row, Column A, Column B
>> 1, 10, 100
>> 2, 20, 200
>> 3, 30, 300
>> 
>> 
>> The result that I need should be like:
>> 
>> Row, Column A, Column B, Result
>> 1, 10, 100, NULL
>> 2, 20, 200, 100 (200-100)
>> 3, 30, 300, 100 (300-200)
>> 
>> Rgds,
>> Karan
>> 
>> 
>> 
>> 
>> 
>> This e-mail and any attachments are confidential and intended solely for the
>> addressee and may also be privileged or exempt from disclosure under
>> applicable law. If you are not the addressee, or have received this e-mail
>> in error, please notify the sender immediately, delete it from your system
>> and do not copy, disclose or otherwise act upon any part of this e-mail or
>> its attachments.
>> 
>> Internet communications are not guaranteed to be secure or virus-free.
>> The Barclays Group does not accept responsibility for any loss arising from
>> unauthorised access to, or interference with, any Internet communications by
>> any third party, or from the transmission of any viruses. Replies to this
>> e-mail may be monitored by the Barclays Group for operational or business
>> reasons.
>> 
>> Any opinion or other information in this e-mail or its attachments that does
>> not relate to the business of the Barclays Group is personal to the sender
>> and is not given or endorsed by the Barclays Group.
>> 
>> Barclays Bank PLC.Registered in England and Wales (registered no. 1026167).
>> Registered Office: 1 Churchill Place, London, E14 5HP, United Kingdom.
>> 
>> Barclays Bank PLC is authorised and regulated by the Financial Services
>> Authority.

RE: Lag function in Hive

Reply via email to