Igor Dvorzhak created HIVE-22894: ------------------------------------ Summary: Filter on subquery with GROUP BY returns wrong column Key: HIVE-22894 URL: https://issues.apache.org/jira/browse/HIVE-22894 Project: Hive Issue Type: Bug Components: CBO Affects Versions: 2.3.6 Reporter: Igor Dvorzhak
Reproduction steps: {code:java} $ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - /user/hive/warehouse/test/data.csv $ hive > CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW > FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE; OK Time taken: 0.877 seconds > SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM > test HERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t > WHERE t.category='C_A'; OK C_A_B C_A_C Time taken: 9.108 seconds, Fetched: 2 row(s) > EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory > FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) > AS t WHERE t.category='C_A'; OK Plan optimized by CBO.Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 File Output Operator [FS_12] Group By Operator [GBY_10] (rows=1 width=38) Output:["_col0"],keys:_col0 Group By Operator [GBY_5] (rows=1 width=38) Output:["_col0"],keys:KEY._col0 <-Map 1 [SIMPLE_EDGE] SHUFFLE [RS_4] PartitionCols:_col0 Group By Operator [GBY_3] (rows=1 width=38) Output:["_col0"],keys:subcategory Select Operator [SEL_2] (rows=1 width=38) Output:["subcategory"] Filter Operator [FIL_13] (rows=1 width=38) predicate:((date_str = '02/11/20') and (category = 'C_A')) TableScan [TS_0] (rows=1 width=38) default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"] Time taken: 0.21 seconds, Fetched: 27 row(s) {code} It works as expected with disabled CBO: {code:java} > SET hive.cbo.enable=false; > SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM > test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t > WHERE t.category='C_A'; OK C_A Time taken: 13.948 seconds, Fetched: 1 row(s) > EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory > FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) > AS t WHERE t.category='C_A'; OK Vertex dependency in root stage Reducer 2 <- Map 1 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 3 File Output Operator [FS_13] Select Operator [SEL_12] (rows=1 width=38) Output:["_col0"] Group By Operator [GBY_11] (rows=1 width=38) Output:["_col0"],keys:'C_A' <-Reducer 2 [SIMPLE_EDGE] SHUFFLE [RS_10] PartitionCols:'C_A' Group By Operator [GBY_9] (rows=1 width=38) Output:["_col0"],keys:'C_A' Select Operator [SEL_6] (rows=1 width=38) Group By Operator [GBY_5] (rows=1 width=38) Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', KEY._col2 <-Map 1 [SIMPLE_EDGE] SHUFFLE [RS_4] PartitionCols:'02/11/20', 'C_A', _col2 Group By Operator [GBY_3] (rows=1 width=38) Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', subcategory Select Operator [SEL_2] (rows=1 width=38) Output:["subcategory"] Filter Operator [FIL_14] (rows=1 width=38) predicate:((date_str = '02/11/20') and (category = 'C_A')) TableScan [TS_0] (rows=1 width=38) default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"] Time taken: 0.065 seconds, Fetched: 34 row(s){code} -- This message was sent by Atlassian Jira (v8.3.4#803005)