Igor Dvorzhak created HIVE-22894:
------------------------------------

             Summary: Filter on subquery with GROUP BY returns wrong column
                 Key: HIVE-22894
                 URL: https://issues.apache.org/jira/browse/HIVE-22894
             Project: Hive
          Issue Type: Bug
          Components: CBO
    Affects Versions: 2.3.6
            Reporter: Igor Dvorzhak


Reproduction steps:
{code:java}
$ echo -e "02/11/20,C_A,C_A_B\n02/11/20,C_A,C_A_C" | hadoop fs -put - 
/user/hive/warehouse/test/data.csv

$ hive

> CREATE TABLE test(date_str STRING, category STRING, subcategory STRING) ROW 
> FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE;
OK
Time taken: 0.877 seconds

> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM 
> test HERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t 
> WHERE t.category='C_A';
OK
C_A_B
C_A_C
Time taken: 9.108 seconds, Fetched: 2 row(s)

> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory 
> FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) 
> AS t WHERE t.category='C_A';
OK
Plan optimized by CBO.Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 2
      File Output Operator [FS_12]
        Group By Operator [GBY_10] (rows=1 width=38)
          Output:["_col0"],keys:_col0
          Group By Operator [GBY_5] (rows=1 width=38)
            Output:["_col0"],keys:KEY._col0
          <-Map 1 [SIMPLE_EDGE]
            SHUFFLE [RS_4]
              PartitionCols:_col0
              Group By Operator [GBY_3] (rows=1 width=38)
                Output:["_col0"],keys:subcategory
                Select Operator [SEL_2] (rows=1 width=38)
                  Output:["subcategory"]
                  Filter Operator [FIL_13] (rows=1 width=38)
                    predicate:((date_str = '02/11/20') and (category = 'C_A'))
                    TableScan [TS_0] (rows=1 width=38)
                      
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.21 seconds, Fetched: 27 row(s)
{code}
 

It works as expected with disabled CBO:

{code:java}
> SET hive.cbo.enable=false;

> SELECT DISTINCT category FROM (SELECT date_str, category, subcategory FROM 
> test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) AS t 
> WHERE t.category='C_A';
OK
C_A
Time taken: 13.948 seconds, Fetched: 1 row(s)

> EXPLAIN SELECT DISTINCT category FROM (SELECT date_str, category, subcategory 
> FROM test WHERE date_str='02/11/20' GROUP BY date_str, category, subcategory) 
> AS t WHERE t.category='C_A';
OK
Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)Stage-0
  Fetch Operator
    limit:-1
    Stage-1
      Reducer 3
      File Output Operator [FS_13]
        Select Operator [SEL_12] (rows=1 width=38)
          Output:["_col0"]
          Group By Operator [GBY_11] (rows=1 width=38)
            Output:["_col0"],keys:'C_A'
          <-Reducer 2 [SIMPLE_EDGE]
            SHUFFLE [RS_10]
              PartitionCols:'C_A'
              Group By Operator [GBY_9] (rows=1 width=38)
                Output:["_col0"],keys:'C_A'
                Select Operator [SEL_6] (rows=1 width=38)
                  Group By Operator [GBY_5] (rows=1 width=38)
                    Output:["_col0","_col1","_col2"],keys:'02/11/20', 'C_A', 
KEY._col2
                  <-Map 1 [SIMPLE_EDGE]
                    SHUFFLE [RS_4]
                      PartitionCols:'02/11/20', 'C_A', _col2
                      Group By Operator [GBY_3] (rows=1 width=38)
                        Output:["_col0","_col1","_col2"],keys:'02/11/20', 
'C_A', subcategory
                        Select Operator [SEL_2] (rows=1 width=38)
                          Output:["subcategory"]
                          Filter Operator [FIL_14] (rows=1 width=38)
                            predicate:((date_str = '02/11/20') and (category = 
'C_A'))
                            TableScan [TS_0] (rows=1 width=38)
                              
default@test,test,Tbl:COMPLETE,Col:NONE,Output:["date_str","category","subcategory"]
Time taken: 0.065 seconds, Fetched: 34 row(s){code}
 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to