[ https://issues.apache.org/jira/browse/HIVE-24710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Balamohan updated HIVE-24710: ------------------------------------ Description: E.g query {noformat} select x, y, count(*) over (partition by x order by y range between 86400 PRECEDING and CURRENT ROW) r0 from foo {noformat} 1. In such cases, there is no need to iterate over the rowcontainers often (internally it does O(n^2) operations taking forever when window frame is really large). This can be optimised to reduce CPU burn and IO. 2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when parameters are empty. This codepath can also be optimised. was: PTFRowContainer could be reading the same block repeatedly for the first block. Default block size is around 25000. For the first 25000 rowIdx, it would read the block repeatedly due to ("rowIdx < currentReadBlockStartRow ") condition. {noformat} public Row getAt(int rowIdx) throws HiveException { int blockSize = getBlockSize(); if ( rowIdx < currentReadBlockStartRow || rowIdx >= currentReadBlockStartRow + blockSize ) { readBlock(getBlockNum(rowIdx)); } return getReadBlockRow(rowIdx - currentReadBlockStartRow); } {noformat} https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/PTFRowContainer.java#L167 > Optimise PTF iteration for count(*) to reduce CPU and IO cost > ------------------------------------------------------------- > > Key: HIVE-24710 > URL: https://issues.apache.org/jira/browse/HIVE-24710 > Project: Hive > Issue Type: Improvement > Components: HiveServer2 > Reporter: Rajesh Balamohan > Priority: Major > Labels: performance > > E.g query > {noformat} > select x, y, count(*) over (partition by x order by y range between 86400 > PRECEDING and CURRENT ROW) r0 from foo > {noformat} > 1. In such cases, there is no need to iterate over the rowcontainers often > (internally it does O(n^2) operations taking forever when window frame is > really large). This can be optimised to reduce CPU burn and IO. > 2. BasePartitionEvaluator::calcFunctionValue need not materialize ROW when > parameters are empty. This codepath can also be optimised. -- This message was sent by Atlassian Jira (v8.3.4#803005)