LAST analytical windowing functions to Hive.

Harish Butani (JIRA) Tue, 08 Jan 2013 20:32:23 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547625#comment-13547625
 ]


Harish Butani commented on HIVE-896:
------------------------------------

Hi Alan,
Thanks for taking the time. Here are my responses:

1. Could you point out the interfaces...
Yes you are right, from a function writer perspective TableFunctionEvaluator, 
TableFunctionResolver are the important ifcs; PTFPartition(and 
PTFPartitionIterator) is the data container ifc.

2. If I read this right you are using CLUSTER BY and SORT BY instead of 
PARTITION BY and ORDER BY for syntax in OVER. Why?
To highlight the similarity. The Partition/Order specs in a Window clause have 
the same meaning as Cluster/Distribute in HQL. Note you can  use a 
Cluster/Distribute at the query level and not specify any Partition spec in a 
Window clause. So the following are different ways for saying the same thing:

a.
select p_mfgr, p_name, 
sum(p_retailprice) over (distribute by p_mfgr sort by p_name rows between 
unbounded preceding and current row)
from part;
b.
select p_mfgr, p_name, p_size,
sum(p_retailprice) over (rows between unbounded preceding and current row)
from part
distribute by p_mfgr
sort by p_name;
c.
select p_mfgr, p_name, p_size,
sum(p_retailprice) over (w1)
from part
window w1 as distribute by p_mfgr  sort by p_name rows between 2 preceding and 
2 following;

(I just realized that there are no egs of using Cluster/Distribute in Wdw 
clauses in the tests; we are adding them now)

3. Can I put one of the existing aggregate functions in an OVER clause using 
this?
I am not exactly clear what your question is. I may have answered it above. To 
be clear there is no special Window Function. Any existing Hive UDAF invocation 
can have a Windowing specification. 
tests 31,40,41 cover most of the UDAFs.

4. Could you explain how the partition is handled in memory...
Partitions are backed by a Persistent List ( see 
ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this 
package. Yes you are right, things can be done in delaying bringing rows into a 
partition and getting rid of rows once outside the window. This is true for 
Windowing Table Function; especially for Range based Windows.

But for a general PTF the contract is Partition in Partition out. For e.g. 
CandidateFrequency function will read the rows in a partition multiple times.

The PartitionedByteBasedList is backed by a set of PersistentByteBasedLists 
which uses weak refs and stores its data on disk. Done some testing with 
partitions with a million rows. But I agree with what you are getting at: there 
is stuff that can be done to reduce the memory footprint. Haven't gotten around 
to it....

                
> Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
> ---------------------------------------------------------------
>
>                 Key: HIVE-896
>                 URL: https://issues.apache.org/jira/browse/HIVE-896
>             Project: Hive
>          Issue Type: New Feature
>          Components: OLAP, UDF
>            Reporter: Amr Awadallah
>            Priority: Minor
>         Attachments: HIVE-896.1.patch.txt
>
>
> Windowing functions are very useful for click stream processing and similar 
> time-series/sliding-window analytics.
> More details at:
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1006709
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007059
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007032
> -- amr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-896) Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.

Reply via email to