[ https://issues.apache.org/jira/browse/HIVE-26671?focusedWorklogId=820955&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-820955 ]
ASF GitHub Bot logged work on HIVE-26671: ----------------------------------------- Author: ASF GitHub Bot Created on: 27/Oct/22 12:01 Start Date: 27/Oct/22 12:01 Worklog Time Spent: 10m Work Description: kasakrisz commented on code in PR #3706: URL: https://github.com/apache/hive/pull/3706#discussion_r1006475709 ########## ql/src/java/org/apache/hadoop/hive/ql/optimizer/topnkey/TopNKeyPushdownProcessor.java: ########## @@ -347,14 +347,12 @@ private void pushdownThroughTopNKey(TopNKeyOperator topNKey) throws SemanticExce if (topNKeyDesc.getKeyColumns().size() == commonKeyPrefix.size()) { // TNK keys are subset of the parent TNK keys pushdownThroughParent(topNKey); - if (topNKey.getChildOperators().get(0).getType() == OperatorType.TOPNKEY) { - LOG.debug("Removing {} since child {} supersedes it", parent.getName(), topNKey.getName()); - topNKey.getParentOperators().get(0).removeChildAndAdoptItsChildren(topNKey); - } + LOG.debug("Removing {} since child {} supersedes it", parent.getName(), topNKey.getName()); + parent.getParentOperators().get(0).removeChildAndAdoptItsChildren(parent); Review Comment: This does not seems to be a valid change because the previous statement is `pushdownThroughParent(topNKey);` Let assume we have the following operator subtree: ``` SEL TNK[P] (key = a, b) TNK[C] (key = a) ``` Two outcome exists depending on how far the `pushdownThroughParent(topNKey);` can push down `TNK[C]`: * `TNK[C]` pushed through `SEL` ``` TNK[C] (key = a) SEL TNK[P] (key = a, b) ``` We stop here since `SEL` should process rows filtered by `TNK[P] (key = a, b)` * `TNK[C]` pushed through only `TNK[P]` ``` SEL TNK[C] (key = a) TNK[P] (key = a, b) ``` In this case `TNK[C]` can be removed since `TNK[P]` filters out more rows and there are no operators between the TNKs which would process the rows filtered out by `TNK[P]`. This is what the condition `(topNKey.getChildOperators().get(0).getType() == OperatorType.TOPNKEY)` checks and if it is true we do the remove: ``` SEL TNK[P] (key = a, b) ``` ########## ql/src/test/results/clientpositive/llap/bucket_groupby.q.out: ########## @@ -1753,9 +1753,9 @@ STAGE PLANS: filterExpr: (ds = '103') (type: boolean) Statistics: Num rows: 500 Data size: 89000 Basic stats: COMPLETE Column stats: COMPLETE Top N Key Operator - sort order: ++ - keys: key (type: string), value (type: string) - null sort order: zz + sort order: + + keys: key (type: string) + null sort order: z Review Comment: I think removing one of the key columns from TNK is too aggressive in this case because both GBYs in the mapper and reducer has the same keys. Issue Time Tracking ------------------- Worklog Id: (was: 820955) Time Spent: 0.5h (was: 20m) > Incorrect results for group by/order by/limit query with 2 aggregates > --------------------------------------------------------------------- > > Key: HIVE-26671 > URL: https://issues.apache.org/jira/browse/HIVE-26671 > Project: Hive > Issue Type: Bug > Components: Operators > Reporter: Steve Carlin > Assignee: Steve Carlin > Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Grabbed this query from the Impala test suite. It is a query run off of > tpcds tables, but it's not really super special. You will need a lot of data > to reproduce this, though. > select > l_orderkey, > min(l_shipdate) as flt, > count(distinct l_partkey) as cnl > from lineitem > group by l_orderkey order by l_orderkey limit 2; > The issue is with the Top N Key operator optimizer. The Top N Key operator is > the first operator after the Table Scan. The sort key is on both the > l_orderkey and l_partkey columns, but this means that the second sort key > might not be forwarded. -- This message was sent by Atlassian Jira (v8.20.10#820010)