[
https://issues.apache.org/jira/browse/HIVE-896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13547625#comment-13547625
]
Harish Butani commented on HIVE-896:
------------------------------------
Hi Alan,
Thanks for taking the time. Here are my responses:
1. Could you point out the interfaces...
Yes you are right, from a function writer perspective TableFunctionEvaluator,
TableFunctionResolver are the important ifcs; PTFPartition(and
PTFPartitionIterator) is the data container ifc.
2. If I read this right you are using CLUSTER BY and SORT BY instead of
PARTITION BY and ORDER BY for syntax in OVER. Why?
To highlight the similarity. The Partition/Order specs in a Window clause have
the same meaning as Cluster/Distribute in HQL. Note you can use a
Cluster/Distribute at the query level and not specify any Partition spec in a
Window clause. So the following are different ways for saying the same thing:
a.
select p_mfgr, p_name,
sum(p_retailprice) over (distribute by p_mfgr sort by p_name rows between
unbounded preceding and current row)
from part;
b.
select p_mfgr, p_name, p_size,
sum(p_retailprice) over (rows between unbounded preceding and current row)
from part
distribute by p_mfgr
sort by p_name;
c.
select p_mfgr, p_name, p_size,
sum(p_retailprice) over (w1)
from part
window w1 as distribute by p_mfgr sort by p_name rows between 2 preceding and
2 following;
(I just realized that there are no egs of using Cluster/Distribute in Wdw
clauses in the tests; we are adding them now)
3. Can I put one of the existing aggregate functions in an OVER clause using
this?
I am not exactly clear what your question is. I may have answered it above. To
be clear there is no special Window Function. Any existing Hive UDAF invocation
can have a Windowing specification.
tests 31,40,41 cover most of the UDAFs.
4. Could you explain how the partition is handled in memory...
Partitions are backed by a Persistent List ( see
ptf.ds.PartitionedByteBasedList) . We need do to some work to refactor this
package. Yes you are right, things can be done in delaying bringing rows into a
partition and getting rid of rows once outside the window. This is true for
Windowing Table Function; especially for Range based Windows.
But for a general PTF the contract is Partition in Partition out. For e.g.
CandidateFrequency function will read the rows in a partition multiple times.
The PartitionedByteBasedList is backed by a set of PersistentByteBasedLists
which uses weak refs and stores its data on disk. Done some testing with
partitions with a million rows. But I agree with what you are getting at: there
is stuff that can be done to reduce the memory footprint. Haven't gotten around
to it....
> Add LEAD/LAG/FIRST/LAST analytical windowing functions to Hive.
> ---------------------------------------------------------------
>
> Key: HIVE-896
> URL: https://issues.apache.org/jira/browse/HIVE-896
> Project: Hive
> Issue Type: New Feature
> Components: OLAP, UDF
> Reporter: Amr Awadallah
> Priority: Minor
> Attachments: HIVE-896.1.patch.txt
>
>
> Windowing functions are very useful for click stream processing and similar
> time-series/sliding-window analytics.
> More details at:
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1006709
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007059
> http://download-west.oracle.com/docs/cd/B13789_01/server.101/b10736/analysis.htm#i1007032
> -- amr
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira