[
https://issues.apache.org/jira/browse/HIVE-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078906#comment-13078906
]
Charles Chen commented on HIVE-2337:
------------------------------------
Using:
{noformat}
create table t1 (id int, key string, value string);
create table t2 (id int, key string, value string);
create table t3 (id int, key string, value string);
create table t4 (id int, key string, value string);
{noformat}
For example, the query
{noformat}
explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on
t2.id=t3.id where t3.id=20;
{noformat}
currently gives
{noformat}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t1
TableScan
alias: t1
Reduce Output Operator
key expressions:
expr: id
type: int
sort order: +
Map-reduce partition columns:
expr: id
type: int
tag: 0
value expressions:
expr: id
type: int
expr: key
type: string
expr: value
type: string
t2
TableScan
alias: t2
Reduce Output Operator
key expressions:
expr: id
type: int
sort order: +
Map-reduce partition columns:
expr: id
type: int
tag: 1
value expressions:
expr: id
type: int
expr: key
type: string
expr: value
type: string
t3
TableScan
alias: t3
Reduce Output Operator
key expressions:
expr: id
type: int
sort order: +
Map-reduce partition columns:
expr: id
type: int
tag: 2
value expressions:
expr: id
type: int
expr: key
type: string
expr: value
type: string
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
Inner Join 1 to 2
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10,
_col11, _col12
Filter Operator
predicate:
expr: (_col10 = 20)
type: boolean
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: string
expr: _col2
type: string
expr: _col5
type: int
expr: _col6
type: string
expr: _col7
type: string
expr: _col10
type: int
expr: _col11
type: string
expr: _col12
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
_col6, _col7, _col8
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
while the correct behavior is to push the filter "t3.id=20" down:
{noformat}
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t1
TableScan
alias: t1
Reduce Output Operator
key expressions:
expr: id
type: int
sort order: +
Map-reduce partition columns:
expr: id
type: int
tag: 0
value expressions:
expr: id
type: int
expr: key
type: string
expr: value
type: string
t2
TableScan
alias: t2
Reduce Output Operator
key expressions:
expr: id
type: int
sort order: +
Map-reduce partition columns:
expr: id
type: int
tag: 1
value expressions:
expr: id
type: int
expr: key
type: string
expr: value
type: string
t3
TableScan
alias: t3
Filter Operator
predicate:
expr: (id = 20)
type: boolean
Reduce Output Operator
key expressions:
expr: id
type: int
sort order: +
Map-reduce partition columns:
expr: id
type: int
tag: 2
value expressions:
expr: id
type: int
expr: key
type: string
expr: value
type: string
Reduce Operator Tree:
Join Operator
condition map:
Outer Join 0 to 1
Inner Join 1 to 2
condition expressions:
0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7, _col10,
_col11, _col12
Select Operator
expressions:
expr: _col0
type: int
expr: _col1
type: string
expr: _col2
type: string
expr: _col5
type: int
expr: _col6
type: string
expr: _col7
type: string
expr: _col10
type: int
expr: _col11
type: string
expr: _col12
type: string
outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6,
_col7, _col8
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Stage: Stage-0
Fetch Operator
limit: -1
{noformat}
The current behavior is actually stranger than this: for a left outer join
(similarly for a right outer join), hive finds the leftmost alias referred to
in the *predicates* of left outer joins and rejects any alias to the right of
it for pushdown. So in this query the filter "t2.id=20" pushed down:
{noformat}
explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on
(t2.id=t3.id) where t2.id=20;
{noformat}
while it isn't here:
{noformat}
explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on
(t1.id=t3.id) where t2.id=20;
{noformat}
> Predicate pushdown erroneously conservative with outer joins
> ------------------------------------------------------------
>
> Key: HIVE-2337
> URL: https://issues.apache.org/jira/browse/HIVE-2337
> Project: Hive
> Issue Type: Bug
> Components: Query Processor
> Reporter: Charles Chen
> Assignee: Charles Chen
> Attachments: HIVE-2337v1.patch
>
>
> The predicate pushdown filter is not applying left associativity of joins
> correctly in determining possible aliases for pushing predicates.
> In hive.ql.ppd.OpProcFactory.JoinPPD.getQualifiedAliases, the criteria for
> pushing aliases is specified as:
> {noformat}
> /**
> * Figures out the aliases for whom it is safe to push predicates based on
> * ANSI SQL semantics For inner join, all predicates for all aliases can
> be
> * pushed For full outer join, none of the predicates can be pushed as
> that
> * would limit the number of rows for join For left outer join, all the
> * predicates on the left side aliases can be pushed up For right outer
> * join, all the predicates on the right side aliases can be pushed up
> Joins
> * chain containing both left and right outer joins are treated as full
> * outer join. [...]
> *
> * @param op
> * Join Operator
> * @param rr
> * Row resolver
> * @return set of qualified aliases
> */
> {noformat}
> Since hive joins are left associative, something like "a RIGHT OUTER JOIN b
> LEFT OUTER JOIN cĀ INNER JOIN d" should be interpreted as "((a RIGHT OUTER
> JOIN b) LEFT OUTER JOIN c) INNER JOIN d", so there would be cases where joins
> with both left and right outer joins can have aliases that can be pushed.
> Here, aliases b and d are eligible to be pushed up while the current criteria
> provide that none are eligible.
> Using:
> {noformat}
> create table t1 (id int, key string, value string);
> create table t2 (id int, key string, value string);
> create table t3 (id int, key string, value string);
> create table t4 (id int, key string, value string);
> {noformat}
> For example, the query
> {noformat}
> explain select * from t1 full outer join t2 on t1.id=t2.id join t3 on
> t2.id=t3.id where t3.id=20;
> {noformat}
> currently gives
> {noformat}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t1
> TableScan
> alias: t1
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 0
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t2
> TableScan
> alias: t2
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 1
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t3
> TableScan
> alias: t3
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 2
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> Reduce Operator Tree:
> Join Operator
> condition map:
> Outer Join 0 to 1
> Inner Join 1 to 2
> condition expressions:
> 0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> handleSkewJoin: false
> outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7,
> _col10, _col11, _col12
> Filter Operator
> predicate:
> expr: (_col10 = 20)
> type: boolean
> Select Operator
> expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> expr: _col2
> type: string
> expr: _col5
> type: int
> expr: _col6
> type: string
> expr: _col7
> type: string
> expr: _col10
> type: int
> expr: _col11
> type: string
> expr: _col12
> type: string
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
> _col6, _col7, _col8
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {noformat}
> while the correct behavior is to push the filter "t3.id=20" down:
> {noformat}
> STAGE DEPENDENCIES:
> Stage-1 is a root stage
> Stage-0 is a root stage
> STAGE PLANS:
> Stage: Stage-1
> Map Reduce
> Alias -> Map Operator Tree:
> t1
> TableScan
> alias: t1
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 0
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t2
> TableScan
> alias: t2
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 1
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> t3
> TableScan
> alias: t3
> Filter Operator
> predicate:
> expr: (id = 20)
> type: boolean
> Reduce Output Operator
> key expressions:
> expr: id
> type: int
> sort order: +
> Map-reduce partition columns:
> expr: id
> type: int
> tag: 2
> value expressions:
> expr: id
> type: int
> expr: key
> type: string
> expr: value
> type: string
> Reduce Operator Tree:
> Join Operator
> condition map:
> Outer Join 0 to 1
> Inner Join 1 to 2
> condition expressions:
> 0 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 1 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> 2 {VALUE._col0} {VALUE._col1} {VALUE._col2}
> handleSkewJoin: false
> outputColumnNames: _col0, _col1, _col2, _col5, _col6, _col7,
> _col10, _col11, _col12
> Select Operator
> expressions:
> expr: _col0
> type: int
> expr: _col1
> type: string
> expr: _col2
> type: string
> expr: _col5
> type: int
> expr: _col6
> type: string
> expr: _col7
> type: string
> expr: _col10
> type: int
> expr: _col11
> type: string
> expr: _col12
> type: string
> outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5,
> _col6, _col7, _col8
> File Output Operator
> compressed: false
> GlobalTableId: 0
> table:
> input format: org.apache.hadoop.mapred.TextInputFormat
> output format:
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> Stage: Stage-0
> Fetch Operator
> limit: -1
> {noformat}
> The current behavior is actually stranger than this: for a left outer join
> (similarly for a right outer join), hive finds the leftmost alias referred to
> in the *predicates* of left outer joins and rejects any alias to the right of
> it for pushdown. So in this query the filter "t2.id=20" pushed down:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on
> (t2.id=t3.id) where t2.id=20;
> {noformat}
> while it isn't here:
> {noformat}
> explain select * from t1 join t2 on (t1.id=t2.id) left outer join t3 on
> (t1.id=t3.id) where t2.id=20;
> {noformat}
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira