[ https://issues.apache.org/jira/browse/HIVE-26671?focusedWorklogId=820962&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-820962 ]
ASF GitHub Bot logged work on HIVE-26671: ----------------------------------------- Author: ASF GitHub Bot Created on: 27/Oct/22 12:16 Start Date: 27/Oct/22 12:16 Worklog Time Spent: 10m Work Description: kasakrisz commented on PR #3706: URL: https://github.com/apache/hive/pull/3706#issuecomment-1293439464 Thanks @scarlin-cloudera for investigating this issue. This patch is a possible solution. I would like to share another approach: IIUC the issues is caused by the extra key column because of the distinct in the RS located in the mapper. https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/parse/SemanticAnalyzer.java#L5753 Without TNK the plan of the query mentioned in the jira looks like this: ``` Map TS SEL GBY (l_orderkey, l_partkey) RS (l_orderkey, l_partkey) Reduce GBY (KEY._col0) RS (col0) ... ``` A TNK is created on top of each RS and the keys are coming from the corresponding RS then both TNKs pushed until TS and at TNK merging the one with 2 keys are accepted. How about skipping TNK creation if RS has keys defined because of distinct in `TopNKeyProcessor` https://github.com/apache/hive/blob/16ce75578c265d0aaba7eedafb65658fc569f75e/ql/src/java/org/apache/hadoop/hive/ql/plan/ReduceSinkDesc.java#L424-L426 and keep the existing behavior when no distinct aggregates present. I would expect that only TNK (l_orderkey) remains. What do you think? Issue Time Tracking ------------------- Worklog Id: (was: 820962) Time Spent: 40m (was: 0.5h) > Incorrect results for group by/order by/limit query with 2 aggregates > --------------------------------------------------------------------- > > Key: HIVE-26671 > URL: https://issues.apache.org/jira/browse/HIVE-26671 > Project: Hive > Issue Type: Bug > Components: Operators > Reporter: Steve Carlin > Assignee: Steve Carlin > Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > Grabbed this query from the Impala test suite. It is a query run off of > tpcds tables, but it's not really super special. You will need a lot of data > to reproduce this, though. > select > l_orderkey, > min(l_shipdate) as flt, > count(distinct l_partkey) as cnl > from lineitem > group by l_orderkey order by l_orderkey limit 2; > The issue is with the Top N Key operator optimizer. The Top N Key operator is > the first operator after the Table Scan. The sort key is on both the > l_orderkey and l_partkey columns, but this means that the second sort key > might not be forwarded. -- This message was sent by Atlassian Jira (v8.20.10#820010)