Invalid predicate pushdown from incorrect column expression map for select operator generated by GROUP BY operation -------------------------------------------------------------------------------------------------------------------
Key: HIVE-2382 URL: https://issues.apache.org/jira/browse/HIVE-2382 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.6.0 Reporter: Charles Chen Assignee: Charles Chen Priority: Critical When a GROUP BY is specified, a select operator is added before the GROUP BY in SemanticAnalyzer.insertSelectAllPlanForGroupBy. Currently, the column expression map for this is set to the column expression map for the parent operator. This behavior is incorrect as, for example, the parent operator could rearrange the order of the columns (_col0 => _col0, _col1 => _col2, _col2 => _col1) and the new operator should not repeat this. The predicate pushdown optimization uses the column expression map to track which columns a filter expression refers to at different operators. This results in a filter on incorrect columns. Here is a simple case of this going wrong: Using {noformat} create table invites (id int, foo int, bar int) {noformat} executing the query {noformat} explain select * from (select foo, bar from (select bar, foo from invites c union all select bar, foo from invites d) b) a group by bar, foo having bar=1; {noformat} results in {noformat} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a-subquery1:b-subquery1:c TableScan alias: c Filter Operator predicate: expr: (foo = 1) type: boolean Select Operator expressions: expr: bar type: int expr: foo type: int outputColumnNames: _col0, _col1 Union Select Operator expressions: expr: _col1 type: int expr: _col0 type: int outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col0 type: int expr: _col1 type: int outputColumnNames: _col0, _col1 Group By Operator bucketGroup: false keys: expr: _col1 type: int expr: _col0 type: int mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: int expr: _col1 type: int sort order: ++ Map-reduce partition columns: expr: _col0 type: int expr: _col1 type: int tag: -1 a-subquery2:b-subquery2:d TableScan alias: d Filter Operator predicate: expr: (foo = 1) type: boolean Select Operator expressions: expr: bar type: int expr: foo type: int outputColumnNames: _col0, _col1 Union Select Operator expressions: expr: _col1 type: int expr: _col0 type: int outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col0 type: int expr: _col1 type: int outputColumnNames: _col0, _col1 Group By Operator bucketGroup: false keys: expr: _col1 type: int expr: _col0 type: int mode: hash outputColumnNames: _col0, _col1 Reduce Output Operator key expressions: expr: _col0 type: int expr: _col1 type: int sort order: ++ Map-reduce partition columns: expr: _col0 type: int expr: _col1 type: int tag: -1 Reduce Operator Tree: Group By Operator bucketGroup: false keys: expr: KEY._col0 type: int expr: KEY._col1 type: int mode: mergepartial outputColumnNames: _col0, _col1 Select Operator expressions: expr: _col0 type: int expr: _col1 type: int outputColumnNames: _col0, _col1 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {noformat} Note that the filter is now "foo = 1", while the correct behavior is to have "bar = 1". -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira