Jesus Camacho Rodriguez created HIVE-12465:
----------------------------------------------

             Summary: Hive might produce wrong results when (outer) joins are 
merged
                 Key: HIVE-12465
                 URL: https://issues.apache.org/jira/browse/HIVE-12465
             Project: Hive
          Issue Type: Bug
    Affects Versions: 1.3.0, 2.0.0
            Reporter: Jesus Camacho Rodriguez
            Assignee: Jesus Camacho Rodriguez


Consider the following query:

{noformat}
select * from
  (select * from tab where tab.key = 0)a
full outer join
  (select * from tab_part where tab_part.key = 98)b
join
  tab_part c
on a.key = b.key and b.key = c.key;
{noformat}

Hive should execute the full outer join operation (without ON clause) and then 
the join operation (ON a.key = b.key and b.key = c.key). Instead, it merges 
both joins, generating the following plan:

{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tab
            filterExpr: (key = 0) (type: boolean)
            Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE 
Column stats: NONE
            Filter Operator
              predicate: (key = 0) (type: boolean)
              Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE 
Column stats: NONE
              Select Operator
                expressions: 0 (type: int), value (type: string), ds (type: 
string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 121 Data size: 11374 Basic stats: 
COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 121 Data size: 11374 Basic stats: 
COMPLETE Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string)
          TableScan
            alias: tab_part
            filterExpr: (key = 98) (type: boolean)
            Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE 
Column stats: NONE
            Filter Operator
              predicate: (key = 98) (type: boolean)
              Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE 
Column stats: NONE
              Select Operator
                expressions: 98 (type: int), value (type: string), ds (type: 
string)
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 250 Data size: 23500 Basic stats: 
COMPLETE Column stats: NONE
                Reduce Output Operator
                  key expressions: _col0 (type: int)
                  sort order: +
                  Map-reduce partition columns: _col0 (type: int)
                  Statistics: Num rows: 250 Data size: 23500 Basic stats: 
COMPLETE Column stats: NONE
                  value expressions: _col1 (type: string), _col2 (type: string)
          TableScan
            alias: c
            Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE 
Column stats: NONE
            Reduce Output Operator
              key expressions: key (type: int)
              sort order: +
              Map-reduce partition columns: key (type: int)
              Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE 
Column stats: NONE
              value expressions: value (type: string), ds (type: string)
      Reduce Operator Tree:
        Join Operator
          condition map:
               Outer Join 0 to 1
               Inner Join 1 to 2
          keys:
            0 _col0 (type: int)
            1 _col0 (type: int)
            2 key (type: int)
          outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, 
_col7, _col8
          Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE 
Column stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE 
Column stats: NONE
            table:
                input format: org.apache.hadoop.mapred.TextInputFormat
                output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{noformat}

That plan is similar to the following query, which is different than the 
original one:
{noformat}
select * from
  (select * from tab where tab.key = 0)a
full outer join
  (select * from tab_part where tab_part.key = 98)b
on a.key = b.key
join
  tab_part c
on b.key = c.key;
{noformat}

It seems to be a problem in the recognition of join operations that can be 
merged into a single multijoin operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to