[ 
https://issues.apache.org/jira/browse/HIVE-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Chen updated HIVE-2383:
-------------------------------

    Status: Patch Available  (was: Open)

> Incorrect alias filtering for predicate pushdown
> ------------------------------------------------
>
>                 Key: HIVE-2383
>                 URL: https://issues.apache.org/jira/browse/HIVE-2383
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>    Affects Versions: 0.6.0
>            Reporter: Charles Chen
>            Assignee: Charles Chen
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: HIVE-2383v1.patch, HIVE-2383v2.patch, HIVE-2383v5.patch
>
>
> The predicate pushdown optimizer starts at the topmost operators traverses 
> the operator tree, at each stage collecting predicates to be pushed down.  At 
> each operator, ive.ql.ppd.OpProcFactory.DefaultPPD.mergeWithChildrenPred is 
> called, which merges the predicates of the children nodes into the current 
> node.  The predicates are stored in hive.ql.ppd.ExprWalkerInfo.pushdownPreds 
> as a map from the alias a predicate refers to (a predicate may only refer to 
> one alias at a time as only such predicates can be pushed) to a list of such 
> predicates.  Since at each stage the alias the predicate refers to may change 
> (subqueries may change aliases), this is updated for each operator 
> (hive.ql.ppd.ExprWalkerProcFactory.extractPushdownPreds is called which walks 
> the ExprNodeDesc for each predicate). When a JoinOperator is encountered, 
> mergeWithChildrenPred is passed an optional parameter "aliases" which 
> contains a set of aliases that can be pushed per ansi semantics (see 
> hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases).  The part that is 
> incorrect is that aliases are filtered in mergeWithChildrenPred before 
> extractPushdownPreds is called, which associates the predicates with the 
> correct alias in the current operator's context while the filtering should 
> happen after.
> In test case Q2 below, when the predicate "a.bar=3" comes into the 
> JoinOperator, the alias is "a" coming in so it is accepted for pushdown.  
> When brought into the JoinOperator's context, however, since the predicate 
> refers to b.foo in the inner scope, we should not actually accept this for 
> pushdown.
> With the test cases
> {noformat}
> -- Q1: predicate should not be pushed on the right side of a left outer join 
> (this is correct in trunk)
> explain
> SELECT a.foo as foo1, b.foo as foo2, b.bar
> FROM pokes a LEFT OUTER JOIN pokes2 b
> ON a.foo=b.foo
> WHERE b.bar=3;
> -- Q2: predicate should not be pushed on the right side of a left outer join 
> (this is broken in trunk)
> explain
> SELECT * FROM
>     (SELECT a.foo as foo1, b.foo as foo2, b.bar
>     FROM pokes a LEFT OUTER JOIN pokes2 b
>     ON a.foo=b.foo) a
> WHERE a.bar=3;
> -- Q3: predicate should be pushed (this is correct in trunk)
> explain
> SELECT * FROM
>     (SELECT a.foo as foo1, b.foo as foo2, a.bar
>     FROM pokes a JOIN pokes2 b
>     ON a.foo=b.foo) a
> WHERE a.bar=3;
> {noformat}
> The current output is
> {noformat}
> hive> 
>     > -- Q1: predicate should not be pushed on the right side of a left outer 
> join
>     > explain
>     > SELECT a.foo as foo1, b.foo as foo2, b.bar
>     > FROM pokes a LEFT OUTER JOIN pokes2 b
>     > ON a.foo=b.foo
>     > WHERE b.bar=3;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN (TOK_TABREF (TOK_TABNAME pokes) a) 
> (TOK_TABREF (TOK_TABNAME pokes2) b) (= (. (TOK_TABLE_OR_COL a) foo) (. 
> (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR 
> TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) foo) foo1) 
> (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) foo2) (TOK_SELEXPR (. 
> (TOK_TABLE_OR_COL b) bar))) (TOK_WHERE (= (. (TOK_TABLE_OR_COL b) bar) 3))))
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a 
>           TableScan
>             alias: a
>             Reduce Output Operator
>               key expressions:
>                     expr: foo
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: foo
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: foo
>                     type: int
>         b 
>           TableScan
>             alias: b
>             Reduce Output Operator
>               key expressions:
>                     expr: foo
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: foo
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: foo
>                     type: int
>                     expr: bar
>                     type: int
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Left Outer Join0 to 1
>           condition expressions:
>             0 {VALUE._col0}
>             1 {VALUE._col0} {VALUE._col1}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col4, _col5
>           Filter Operator
>             predicate:
>                 expr: (_col5 = 3)
>                 type: boolean
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col4
>                     type: int
>                     expr: _col5
>                     type: int
>               outputColumnNames: _col0, _col1, _col2
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> Time taken: 0.113 seconds
> hive> 
>     > -- Q2: predicate should not be pushed on the right side of a left outer 
> join
>     > explain
>     > SELECT * FROM
>     >     (SELECT a.foo as foo1, b.foo as foo2, b.bar
>     >     FROM pokes a LEFT OUTER JOIN pokes2 b
>     >     ON a.foo=b.foo) a
>     > WHERE a.bar=3;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_LEFTOUTERJOIN 
> (TOK_TABREF (TOK_TABNAME pokes) a) (TOK_TABREF (TOK_TABNAME pokes2) b) (= (. 
> (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT 
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. 
> (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) 
> foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) bar))))) a)) (TOK_INSERT 
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
> TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a:a 
>           TableScan
>             alias: a
>             Reduce Output Operator
>               key expressions:
>                     expr: foo
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: foo
>                     type: int
>               tag: 0
>               value expressions:
>                     expr: foo
>                     type: int
>         a:b 
>           TableScan
>             alias: b
>             Filter Operator
>               predicate:
>                   expr: (bar = 3)
>                   type: boolean
>               Reduce Output Operator
>                 key expressions:
>                       expr: foo
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: foo
>                       type: int
>                 tag: 1
>                 value expressions:
>                       expr: foo
>                       type: int
>                       expr: bar
>                       type: int
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Left Outer Join0 to 1
>           condition expressions:
>             0 {VALUE._col0}
>             1 {VALUE._col0} {VALUE._col1}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col4, _col5
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: int
>                   expr: _col4
>                   type: int
>                   expr: _col5
>                   type: int
>             outputColumnNames: _col0, _col1, _col2
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col1
>                     type: int
>                     expr: _col2
>                     type: int
>               outputColumnNames: _col0, _col1, _col2
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> Time taken: 0.101 seconds
> hive> 
>     > -- Q3: predicate should be pushed
>     > explain
>     > SELECT * FROM
>     >     (SELECT a.foo as foo1, b.foo as foo2, a.bar
>     >     FROM pokes a JOIN pokes2 b
>     >     ON a.foo=b.foo) a
>     > WHERE a.bar=3;
> OK
> ABSTRACT SYNTAX TREE:
>   (TOK_QUERY (TOK_FROM (TOK_SUBQUERY (TOK_QUERY (TOK_FROM (TOK_JOIN 
> (TOK_TABREF (TOK_TABNAME pokes) a) (TOK_TABREF (TOK_TABNAME pokes2) b) (= (. 
> (TOK_TABLE_OR_COL a) foo) (. (TOK_TABLE_OR_COL b) foo)))) (TOK_INSERT 
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. 
> (TOK_TABLE_OR_COL a) foo) foo1) (TOK_SELEXPR (. (TOK_TABLE_OR_COL b) foo) 
> foo2) (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar))))) a)) (TOK_INSERT 
> (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR 
> TOK_ALLCOLREF)) (TOK_WHERE (= (. (TOK_TABLE_OR_COL a) bar) 3))))
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         a:a 
>           TableScan
>             alias: a
>             Filter Operator
>               predicate:
>                   expr: (bar = 3)
>                   type: boolean
>               Reduce Output Operator
>                 key expressions:
>                       expr: foo
>                       type: int
>                 sort order: +
>                 Map-reduce partition columns:
>                       expr: foo
>                       type: int
>                 tag: 0
>                 value expressions:
>                       expr: foo
>                       type: int
>                       expr: bar
>                       type: int
>         a:b 
>           TableScan
>             alias: b
>             Reduce Output Operator
>               key expressions:
>                     expr: foo
>                     type: int
>               sort order: +
>               Map-reduce partition columns:
>                     expr: foo
>                     type: int
>               tag: 1
>               value expressions:
>                     expr: foo
>                     type: int
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Inner Join 0 to 1
>           condition expressions:
>             0 {VALUE._col0} {VALUE._col1}
>             1 {VALUE._col0}
>           handleSkewJoin: false
>           outputColumnNames: _col0, _col1, _col4
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: int
>                   expr: _col4
>                   type: int
>                   expr: _col1
>                   type: int
>             outputColumnNames: _col0, _col1, _col2
>             Select Operator
>               expressions:
>                     expr: _col0
>                     type: int
>                     expr: _col1
>                     type: int
>                     expr: _col2
>                     type: int
>               outputColumnNames: _col0, _col1, _col2
>               File Output Operator
>                 compressed: false
>                 GlobalTableId: 0
>                 table:
>                     input format: org.apache.hadoop.mapred.TextInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> {noformat}
> Note that Q2 is incorrect because the predicate "bar = 3" is incorrectly 
> pushed to the right side of the left outer join (Q1 and Q3 are correct).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to