[
https://issues.apache.org/jira/browse/HIVE-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Charles Chen updated HIVE-2337:
-------------------------------
Attachment: (was: HIVE-2337v2.patch)
> Predicate pushdown erroneously conservative with outer joins
> ------------------------------------------------------------
>
> Key: HIVE-2337
> URL: https://issues.apache.org/jira/browse/HIVE-2337
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Reporter: Charles Chen
> Assignee: Charles Chen
> Attachments: HIVE-2337v1.patch, HIVE-2337v2.patch
>
>
> The predicate pushdown filter is not applying left associativity of joins
> correctly in determining possible aliases for pushing predicates.
> In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for
> pushing aliases is specified as:
> {noformat}
> /**
> * Figures out the aliases for whom it is safe to push predicates based on
> * ANSI SQL semantics For inner join, all predicates for all aliases can
> be
> * pushed For full outer join, none of the predicates can be pushed as
> that
> * would limit the number of rows for join For left outer join, all the
> * predicates on the left side aliases can be pushed up For right outer
> * join, all the predicates on the right side aliases can be pushed up
> Joins
> * chain containing both left and right outer joins are treated as full
> * outer join. [...]
> *
> * @param op
> * Join Operator
> * @param rr
> * Row resolver
> * @return set of qualified aliases
> */
> {noformat}
> Since hive joins are left associative, something like "a RIGHT OUTER JOIN b
> LEFT OUTER JOIN cĀ INNER JOIN d" should be interpreted as "((a RIGHT OUTER
> JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins
> with both left and right outer joins can have aliases that can be pushed.
> Here, aliases b and d are eligible to be pushed up while the current criteria
> provide that none are eligible.
> Using:
> {noformat}
> create table t1 (id int, key string, value string);
> create table t2 (id int, key string, value string);
> create table t3 (id int, key string, value string);
> create table t4 (id int, key string, value string);
> {noformat}
> For example, the query
> {noformat}
> explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on
> t2.id=t3.id where t3.id=20;
> {noformat}
> currently gives
> {noformat}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t1
> TableScan
> alias: t1
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 0
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t2
> TableScan
> alias: t2
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 1
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t3
> TableScan
> alias: t3
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 2
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> Reduce Operator Tree:
> Join Operator
> condition map:
> Outer Join 0 to 1
> Inner Join 1 to 2
> condition expressions:
> 0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> handleSkewJoin: false
> outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7,
> _col10, _col11, _col12
> Filter Operator
> predicate:
> expr: (_col10 = 20)
> type: boolean
> Select Operator
> expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> expr: _col2
> type: string
> expr: _col5
> type: int
> expr: _col6
> type: string
> expr: _col7
> type: string
> expr: _col10
> type: int
> expr: _col11
> type: string
> expr: _col12
> type: string
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
> _col6, _col7, _col8
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {noformat}
> while the correct behavior is to push the filter "t3.id=20" down:
> {noformat}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t1
> TableScan
> alias: t1
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 0
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t2
> TableScan
> alias: t2
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 1
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t3
> TableScan
> alias: t3
> Filter Operator
> predicate:
> expr: (id = 20)
> type: boolean
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 2
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> Reduce Operator Tree:
> Join Operator
> condition map:
> Outer Join 0 to 1
> Inner Join 1 to 2
> condition expressions:
> 0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> handleSkewJoin: false
> outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7,
> _col10, _col11, _col12
> Select Operator
> expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> expr: _col2
> type: string
> expr: _col5
> type: int
> expr: _col6
> type: string
> expr: _col7
> type: string
> expr: _col10
> type: int
> expr: _col11
> type: string
> expr: _col12
> type: string
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
> _col6, _col7, _col8
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {noformat}
> The current behavior is actually stranger than this: for a left outer join
> (similarly for a right outer join), hive finds the leftmost alias referred to
> in the *predicates* of left outer joins and rejects any alias to the right of
> it for pushdown. So in this query the filter "t2.id=20" pushed down:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on
> (t2.id=t3.id) where t2.id=20;
> {noformat}
> while it isn't here:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on
> (t1.id=t3.id) where t2.id=20;
> {noformat}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira