Incorrect alias filtering for predicates pushdown -------------------------------------------------
Key: HIVE-2383 URL: https://issues.apache.org/jira/browse/HIVE-2383 Project: Hive Issue Type: Bug Components: Query Processor Affects Versions: 0.6.0 Reporter: Charles Chen Assignee: Charles Chen Priority: Critical The predicate pushdown optimizer starts at the topmost operators traverses the operator tree, at each stage collecting predicates to be pushed down. At each operator, ive.ql.ppd.OpProcFactory.DefaultPPD.mergeWithChildrenPred is called, which merges the predicates of the children nodes into the current node. The predicates are stored in hive.ql.ppd.ExprWalkerInfo.pushdownPreds as a map from the alias a predicate refers to (a predicate may only refer to one alias at a time as only such predicates can be pushed) to a list of such predicates. Since at each stage the alias the predicate refers to may change (subqueries may change aliases), this is updated for each operator (hive.ql.ppd.ExprWalkerProcFactory.extractPushdownPreds is called which walks the ExprNodeDesc for each predicate). When a JoinOperator is encountered, mergeWithChildrenPred is passed an optional parameter "aliases" which contains a set of aliases that can be pushed per ansi semantics (see hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases). The part that is incorrect is that aliases are filtered in mergeWithChildrenPred before extractPushdownPreds is called, which associates the predicates with the correct alias in the current operator's context while the filtering should happen after. In test case Q2 below, when the predicate "a.bar=3" comes into the JoinOperator, the alias is "a" coming in so it is accepted for pushdown. When brought into the JoinOperator's context, however, since the predicate refers to b.foo in the inner scope, we should not actually accept this for pushdown. With the test cases {noformat} -- Q1: predicate should not be pushed on the right side of a left outer join (this is correct in trunk) explain SELECT a.foo as foo1, b.foo as foo2, b.bar FROM pokes a LEFT OUTER JOIN pokes2 b ON a.foo=b.foo WHERE b.bar=3; -- Q2: predicate should not be pushed on the right side of a left outer join (this is broken in trunk) explain SELECT * FROM (SELECT a.foo as foo1, b.foo as foo2, b.bar FROM pokes a LEFT OUTER JOIN pokes2 b ON a.foo=b.foo) a WHERE a.bar=3; -- Q3: predicate should be pushed (this is correct in trunk) explain SELECT * FROM (SELECT a.foo as foo1, b.foo as foo2, a.bar FROM pokes a JOIN pokes2 b ON a.foo=b.foo) a WHERE a.bar=3; {noformat} The current output is {noformat} hive> > -- Q1: predicate should not be pushed on the right side of a left outer join > explain > SELECT a.foo as foo1, b.foo as foo2, b.bar > FROM pokes a LEFT OUTER JOIN pokes2 b > ON a.foo=b.foo > WHERE b.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME pokes) a) (TOK_TABREF (TOK_TABNAME pokes2) b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int expr: bar type: int Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col4, _col5 Filter Operator predicate: expr: (_col5 = 3) type: boolean Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col5 type: int outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 Time taken: 0.113 seconds hive> > -- Q2: predicate should not be pushed on the right side of a left outer join > explain > SELECT * FROM > (SELECT a.foo as foo1, b.foo as foo2, b.bar > FROM pokes a LEFT OUTER JOIN pokes2 b > ON a.foo=b.foo) a > WHERE a.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME pokes) a) (TOK_TABREF (TOK_TABNAME pokes2) b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:a TableScan alias: a Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int a:b TableScan alias: b Filter Operator predicate: expr: (bar = 3) type: boolean Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int expr: bar type: int Reduce Operator Tree: Join Operator condition map: Left Outer Join0 to 1 condition expressions: 0 {VALUE._col0} 1 {VALUE._col0} {VALUE._col1} handleSkewJoin: false outputColumnNames: _col0, _col4, _col5 Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col5 type: int outputColumnNames: _col0, _col1, _col2 Select Operator expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: int outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 Time taken: 0.101 seconds hive> > -- Q3: predicate should be pushed > explain > SELECT * FROM > (SELECT a.foo as foo1, b.foo as foo2, a.bar > FROM pokes a JOIN pokes2 b > ON a.foo=b.foo) a > WHERE a.bar=3; OK ABSTRACT SYNTAX TREE: (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) a) (TOK_TABREF (TOK_TABNAME pokes2) b) (= (. (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3)))) STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 is a root stage STAGE PLANS: Stage: Stage-1 Map Reduce Alias -> Map Operator Tree: a:a TableScan alias: a Filter Operator predicate: expr: (bar = 3) type: boolean Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 0 value expressions: expr: foo type: int expr: bar type: int a:b TableScan alias: b Reduce Output Operator key expressions: expr: foo type: int sort order: + Map-reduce partition columns: expr: foo type: int tag: 1 value expressions: expr: foo type: int Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 condition expressions: 0 {VALUE._col0} {VALUE._col1} 1 {VALUE._col0} handleSkewJoin: false outputColumnNames: _col0, _col1, _col4 Select Operator expressions: expr: _col0 type: int expr: _col4 type: int expr: _col1 type: int outputColumnNames: _col0, _col1, _col2 Select Operator expressions: expr: _col0 type: int expr: _col1 type: int expr: _col2 type: int outputColumnNames: _col0, _col1, _col2 File Output Operator compressed: false GlobalTableId: 0 table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat Stage: Stage-0 Fetch Operator limit: -1 {noformat} Note that Q2 is incorrect because the predicate "bar = 3" is incorrectly pushed to the right side of the left outer join (Q1 and Q3 are correct). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira