[jira] [Updated] (HIVE-24710) Optimise PTF iteration for count(*) to reduce CPU and IO cost

Rajesh Balamohan (Jira) Wed, 03 Feb 2021 01:43:06 -0800


     [ 
https://issues.apache.org/jira/browse/HIVE-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Rajesh Balamohan updated HIVE-24710:
------------------------------------
    Description: 
E.g query

{noformat}
select x, y, count(*) over (partition by x order by y range between 86400 
PRECEDING and CURRENT ROW) r0 from foo
{noformat}

1. In such cases, there is no need to iterate over the rowcontainers often 
(internally it does O(n^2) operations taking forever when window frame is 
really large). This can be optimised to reduce CPU burn and IO.
2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
parameters are empty. This codepath can also be optimised.







  was:
PTFRowContainer could be reading the same block repeatedly for the first block. 
Default block size is around 25000. For the first 25000 rowIdx, it would read 
the block repeatedly due to ("rowIdx < currentReadBlockStartRow ") condition.

{noformat}
 public Row getAt(int rowIdx) throws HiveException {
    int blockSize = getBlockSize();
    if ( rowIdx < currentReadBlockStartRow || rowIdx >= 
currentReadBlockStartRow + blockSize ) {
      readBlock(getBlockNum(rowIdx));
    }
    return getReadBlockRow(rowIdx - currentReadBlockStartRow);
  }
{noformat} 

https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java#L167

 


> Optimise PTF iteration for count(*) to reduce CPU and IO cost
> -------------------------------------------------------------
>
>                 Key: HIVE-24710
>                 URL: https://issues.apache.org/jira/browse/HIVE-24710
>             Project: Hive
>          Issue Type: Improvement
>          Components: HiveServer2
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: performance
>
> E.g query
> {noformat}
> select x, y, count(*) over (partition by x order by y range between 86400 
> PRECEDING and CURRENT ROW) r0 from foo
> {noformat}
> 1. In such cases, there is no need to iterate over the rowcontainers often 
> (internally it does O(n^2) operations taking forever when window frame is 
> really large). This can be optimised to reduce CPU burn and IO.
> 2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when 
> parameters are empty. This codepath can also be optimised.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (HIVE-24710) Optimise PTF iteration for count(*) to reduce CPU and IO cost

Reply via email to