[jira] [Commented] (HIVE-21217) Optimize range calculation for PTF

Adam Szita (JIRA) Wed, 06 Feb 2019 00:58:22 -0800


    [ 
https://issues.apache.org/jira/browse/HIVE-21217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16761593#comment-16761593
 ]


Adam Szita commented on HIVE-21217:
-----------------------------------

[~vgarg] this is valid for windowing queries, that specify an over() close with 
partition and orderby clauses, and the window is of RANGE type. You can take 
the following example that uses a table with a lot of rows containing values 
from a very limited set (1,2,3):
{code:java}
create table test(a int, b int, c int) stored as parquet;
insert into test values (1, 1, 1), (1, 1, 2), (1, 2, 1), (1, 2, 2), (1, 3, 1), 
(1, 3, 2), (1, 3, 3);

--repeat for 9 times:
insert overwrite table test select * from test
union all select * from test
union all select * from test
union all select * from test
union all select * from test
union all select * from test
union all select * from test;
----------------------

select SUM(a) over (partition by b order by c) as sums from test;{code}

> Optimize range calculation for PTF
> ----------------------------------
>
>                 Key: HIVE-21217
>                 URL: https://issues.apache.org/jira/browse/HIVE-21217
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Adam Szita
>            Assignee: Adam Szita
>            Priority: Major
>
> During window function execution Hive has to iterate on neighbouring rows of 
> the current row to find the beginning and end of the proper range (on which 
> the aggregation will be executed).
> When we're using range based windows and have many rows with a certain key 
> value this can take a lot of time. (e.g. partition size of 80M, in which we 
> have 2 ranges of 40M rows according to the orderby column: within these 40M 
> rowsets we're doing 40M x 40M/2 steps.. which is of n^2 time complexity)
> I propose to introduce a cache that keeps track of already calculated range 
> ends so it can be reused in future scans.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-21217) Optimize range calculation for PTF

Reply via email to