Jesus Camacho Rodriguez created HIVE-12465: ----------------------------------------------
Summary: Hive might produce wrong results when (outer) joins are merged Key: HIVE-12465 URL: https://issues.apache.org/jira/browse/HIVE-12465 Project: Hive Issue Type: Bug Affects Versions: 1.3.0, 2.0.0 Reporter: Jesus Camacho Rodriguez Assignee: Jesus Camacho Rodriguez Consider the following query: {noformat} select * from (select * from tab where tab.key = 0)a full outer join (select * from tab_part where tab_part.key = 98)b join tab_part c on a.key = b.key and b.key = c.key; {noformat} Hive should execute the full outer join operation (without ON clause) and then the join operation (ON a.key = b.key and b.key = c.key). Instead, it merges both joins, generating the following plan: {noformat} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Map Reduce Map Operator Tree: TableScan alias: tab filterExpr: (key = 0) (type: boolean) Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (key = 0) (type: boolean) Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 0 (type: int), value (type: string), ds (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string), _col2 (type: string) TableScan alias: tab_part filterExpr: (key = 98) (type: boolean) Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (key = 98) (type: boolean) Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: 98 (type: int), value (type: string), ds (type: string) outputColumnNames: _col0, _col1, _col2 Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: int) sort order: + Map-reduce partition columns: _col0 (type: int) Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column stats: NONE value expressions: _col1 (type: string), _col2 (type: string) TableScan alias: c Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: key (type: int) sort order: + Map-reduce partition columns: key (type: int) Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats: NONE value expressions: value (type: string), ds (type: string) Reduce Operator Tree: Join Operator condition map: Outer Join 0 to 1 Inner Join 1 to 2 keys: 0 _col0 (type: int) 1 _col0 (type: int) 2 key (type: int) outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8 Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.TextInputFormat output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {noformat} That plan is similar to the following query, which is different than the original one: {noformat} select * from (select * from tab where tab.key = 0)a full outer join (select * from tab_part where tab_part.key = 98)b on a.key = b.key join tab_part c on b.key = c.key; {noformat} It seems to be a problem in the recognition of join operations that can be merged into a single multijoin operator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)