LAST analytical windowing functions to Hive.

Alan Gates (JIRA) Fri, 18 Jan 2013 14:42:15 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13557700#comment-13557700
 ]


Alan Gates commented on HIVE-896:
---------------------------------

bq. If I read this right you are using CLUSTER BY and SORT BY instead of 
PARTITION BY and ORDER BY for syntax in OVER. Why?  To highlight the 
similarity. The Partition/Order specs in a Window clause have the same meaning 
as Cluster/Distribute in HQL. 
This is only true as long as you have only one OVER clause, right?  As soon as 
you add the ability to have separate OVER clauses partitioning by different 
keys (which users will want very soon) you lose this identity.

Even if you decide to retain this I would argue that the standard PARTITION 
BY/ORDER BY syntax should be accepted as well.  HQL already has enough one off 
syntax that makes life hard for people coming from more standard SQL.  It 
should not be exacerbated.

bq. Could you explain how the partition is handled in memory...
Partitions are backed by a Persistent List ( see 
ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this 
package. Yes you are right, things can be done in delaying bringing rows into a 
partition and getting rid of rows once outside the window. This is true for 
Windowing Table Function; especially for Range based Windows.
But for a general PTF the contract is Partition in Partition out. For e.g. 
CandidateFrequency function will read the rows in a partition multiple times.

This is part of where I was going with my earlier question on why a windowing 
function would ever return a partition.  I am becoming less convinced that it 
makes sense to combine windowing and partition functions.  While they both take 
partitions as inputs they return different things.  Partition functions return 
partitions and windowing functions return a single value.  As you point out 
here the partition functions will also not be interested in the range limiting 
features of windowing functions.  But taking advantage of this in windowing 
functions will be very important for performance optimizations, I suspect.  At 
the very least it seems like partitioning functions and windowing functions 
should be presented as separate entities to users and UDF writers, even if for 
now Hive shares some of the framework for handling them underneath.  This way 
in the future optimizations and new features can be added in a way that is 
advantageous for each.
                
> Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
> ---------------------------------------------------------------
>
>                 Key: HIVE-896
>                 URL: https://issues.apache.org/jira/browse/HIVE-896
>             Project: Hive
>          Issue Type: New Feature
>          Components: OLAP, UDF
>            Reporter: Amr Awadallah
>            Priority: Minor
>         Attachments: HIVE-896.1.patch.txt
>
>
> Windowing functions are very useful for click stream processing and similar 
> time-series/sliding-window analytics.
> More details at:
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1006709
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007059
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007032
> -- amr

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-896) Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.

Reply via email to