[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

Xuefu Zhang (JIRA) Sun, 17 Jan 2016 20:04:05 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-12736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Xuefu Zhang updated HIVE-12736:
-------------------------------
    Description: 
{code}
select  * from staff;
1       jone    22      1
2       lucy    21      1
3       hmm     22      2
4       james   24      3
5       xiaoliu 23      3

select id,date_ from trade union all select id,"test" from trade ;
1       201510210908
2       201509080234
2       201509080235
1       test
2       test
2       test

set hive.execution.engine=spark;
set spark.master=local;
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
1       jone    22      1       1       201510210908
2       lucy    21      1       2       201509080234
2       lucy    21      1       2       201509080235

set hive.execution.engine=mr;
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
mapjoin hint. Please remove mapjoin hint.
{code}
I have two questions
1.Why result of hive on spark not include the following record?
{code}
1       jone    22      1       1       test
2       lucy    21      1       2       test
2       lucy    21      1       2       test
{code}
2.Why there are two different ways of dealing same query?

explain 1:
{code}
set hive.execution.engine=spark;
set spark.master=local;
explain 
select id,date_ from trade union all select id,"test" from trade;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Spark
      DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: trade
                  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: id (type: int), date_ (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 12 Data size: 96 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: trade
                  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: id (type: int), 'test' (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 12 Data size: 96 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}
explain 2:
{code}
set hive.execution.engine=spark;
set spark.master=local;
explain 
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
OK
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: trade
                  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: id (type: int), date_ (type: string)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 3 Data size: 24 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 id (type: int)
                          1 _col0 (type: int)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:2
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: s
                  Statistics: Num rows: 1 Data size: 66 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 66 Basic stats: COMPLETE 
Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 id (type: int)
                        1 _col0 (type: int)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col7, 
_col8
                      input vertices:
                        1 Map 1
                      Statistics: Num rows: 6 Data size: 52 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: int), _col1 (type: string), 
_col2 (type: int), _col3 (type: int), _col7 (type: int), _col8 (type: string)
                        outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5
                        Statistics: Num rows: 6 Data size: 52 Basic stats: 
COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 6 Data size: 52 Basic stats: 
COMPLETE Column stats: NONE
                          table:
                              input format: 
org.apache.hadoop.mapred.TextInputFormat
                              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

I can't find any information about union "test" in explain 2.

Some properties on hive-site.xml is 
{code}
<property>
<name>hive.ignore.mapjoin.hint</name>
<value>false</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask</name>
<value>true</value>
{code}

  was:
select  * from staff;
1       jone    22      1
2       lucy    21      1
3       hmm     22      2
4       james   24      3
5       xiaoliu 23      3

select id,date_ from trade union all select id,"test" from trade ;
1       201510210908
2       201509080234
2       201509080235
1       test
2       test
2       test

set hive.execution.engine=spark;
set spark.master=local;
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
1       jone    22      1       1       201510210908
2       lucy    21      1       2       201509080234
2       lucy    21      1       2       201509080235

set hive.execution.engine=mr;
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
mapjoin hint. Please remove mapjoin hint.

I have two questions
1.Why result of hive on spark not include the following record?
1       jone    22      1       1       test
2       lucy    21      1       2       test
2       lucy    21      1       2       test

2.Why there are two different ways of dealing same query?


explain 1:
set hive.execution.engine=spark;
set spark.master=local;
explain 
select id,date_ from trade union all select id,"test" from trade;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Spark
      DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: trade
                  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: id (type: int), date_ (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 12 Data size: 96 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: trade
                  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: id (type: int), 'test' (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                    File Output Operator
                      compressed: false
                      Statistics: Num rows: 12 Data size: 96 Basic stats: 
COMPLETE Column stats: NONE
                      table:
                          input format: org.apache.hadoop.mapred.TextInputFormat
                          output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                          serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink


explain 2:
set hive.execution.engine=spark;
set spark.master=local;
explain 
select /*+mapjoin(t)*/ * from staff s join 
(select id,date_ from trade union all select id,"test" from trade ) t on 
s.id=t.id;
OK
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
      DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: trade
                  Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 3 Data size: 24 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: id (type: int), date_ (type: string)
                      outputColumnNames: _col0, _col1
                      Statistics: Num rows: 3 Data size: 24 Basic stats: 
COMPLETE Column stats: NONE
                      Spark HashTable Sink Operator
                        keys:
                          0 id (type: int)
                          1 _col0 (type: int)
            Local Work:
              Map Reduce Local Work

  Stage: Stage-1
    Spark
      DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:2
      Vertices:
        Map 2 
            Map Operator Tree:
                TableScan
                  alias: s
                  Statistics: Num rows: 1 Data size: 66 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: id is not null (type: boolean)
                    Statistics: Num rows: 1 Data size: 66 Basic stats: COMPLETE 
Column stats: NONE
                    Map Join Operator
                      condition map:
                           Inner Join 0 to 1
                      keys:
                        0 id (type: int)
                        1 _col0 (type: int)
                      outputColumnNames: _col0, _col1, _col2, _col3, _col7, 
_col8
                      input vertices:
                        1 Map 1
                      Statistics: Num rows: 6 Data size: 52 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: int), _col1 (type: string), 
_col2 (type: int), _col3 (type: int), _col7 (type: int), _col8 (type: string)
                        outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
_col5
                        Statistics: Num rows: 6 Data size: 52 Basic stats: 
COMPLETE Column stats: NONE
                        File Output Operator
                          compressed: false
                          Statistics: Num rows: 6 Data size: 52 Basic stats: 
COMPLETE Column stats: NONE
                          table:
                              input format: 
org.apache.hadoop.mapred.TextInputFormat
                              output format: 
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
                              serde: 
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
            Local Work:
              Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink



I can't find any information about union "test" in explain 2.

Some properties on hive-site.xml is 
<property>
<name>hive.ignore.mapjoin.hint</name>
<value>false</value>
</property>
<property>
<name>hive.auto.convert.join</name>
<value>true</value>
</property>
<property>
<name>hive.auto.convert.join.noconditionaltask</name>
<value>true</value>


> It seems that result of Hive on Spark be mistaken and result of Hive and Hive 
> on Spark are not the same
> -------------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-12736
>                 URL: https://issues.apache.org/jira/browse/HIVE-12736
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.1.1, 1.2.1
>            Reporter: JoneZhang
>            Assignee: Chengxiang Li
>         Attachments: HIVE-12736.1-spark.patch, HIVE-12736.2-spark.patch
>
>
> {code}
> select  * from staff;
> 1     jone    22      1
> 2     lucy    21      1
> 3     hmm     22      2
> 4     james   24      3
> 5     xiaoliu 23      3
> select id,date_ from trade union all select id,"test" from trade ;
> 1     201510210908
> 2     201509080234
> 2     201509080235
> 1     test
> 2     test
> 2     test
> set hive.execution.engine=spark;
> set spark.master=local;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> 1     jone    22      1       1       201510210908
> 2     lucy    21      1       2       201509080234
> 2     lucy    21      1       2       201509080235
> set hive.execution.engine=mr;
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> FAILED: SemanticException [Error 10227]: Not all clauses are supported with 
> mapjoin hint. Please remove mapjoin hint.
> {code}
> I have two questions
> 1.Why result of hive on spark not include the following record?
> {code}
> 1     jone    22      1       1       test
> 2     lucy    21      1       2       test
> 2     lucy    21      1       2       test
> {code}
> 2.Why there are two different ways of dealing same query?
> explain 1:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select id,date_ from trade union all select id,"test" from trade;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Spark
>       DagName: jonezhang_20151222191643_5301d90a-caf0-4934-8092-d165c87a4190:1
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: trade
>                   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>                   Select Operator
>                     expressions: id (type: int), date_ (type: string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>                       table:
>                           input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                           output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Map 2 
>             Map Operator Tree:
>                 TableScan
>                   alias: trade
>                   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>                   Select Operator
>                     expressions: id (type: int), 'test' (type: string)
>                     outputColumnNames: _col0, _col1
>                     Statistics: Num rows: 6 Data size: 48 Basic stats: 
> COMPLETE Column stats: NONE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 12 Data size: 96 Basic stats: 
> COMPLETE Column stats: NONE
>                       table:
>                           input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                           output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}
> explain 2:
> {code}
> set hive.execution.engine=spark;
> set spark.master=local;
> explain 
> select /*+mapjoin(t)*/ * from staff s join 
> (select id,date_ from trade union all select id,"test" from trade ) t on 
> s.id=t.id;
> OK
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-1 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
>     Spark
>       DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:3
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: trade
>                   Statistics: Num rows: 6 Data size: 48 Basic stats: COMPLETE 
> Column stats: NONE
>                   Filter Operator
>                     predicate: id is not null (type: boolean)
>                     Statistics: Num rows: 3 Data size: 24 Basic stats: 
> COMPLETE Column stats: NONE
>                     Select Operator
>                       expressions: id (type: int), date_ (type: string)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 3 Data size: 24 Basic stats: 
> COMPLETE Column stats: NONE
>                       Spark HashTable Sink Operator
>                         keys:
>                           0 id (type: int)
>                           1 _col0 (type: int)
>             Local Work:
>               Map Reduce Local Work
>   Stage: Stage-1
>     Spark
>       DagName: jonezhang_20151222191716_be7eac84-b5b6-4478-b88f-9f59e2b1b1a8:2
>       Vertices:
>         Map 2 
>             Map Operator Tree:
>                 TableScan
>                   alias: s
>                   Statistics: Num rows: 1 Data size: 66 Basic stats: COMPLETE 
> Column stats: NONE
>                   Filter Operator
>                     predicate: id is not null (type: boolean)
>                     Statistics: Num rows: 1 Data size: 66 Basic stats: 
> COMPLETE Column stats: NONE
>                     Map Join Operator
>                       condition map:
>                            Inner Join 0 to 1
>                       keys:
>                         0 id (type: int)
>                         1 _col0 (type: int)
>                       outputColumnNames: _col0, _col1, _col2, _col3, _col7, 
> _col8
>                       input vertices:
>                         1 Map 1
>                       Statistics: Num rows: 6 Data size: 52 Basic stats: 
> COMPLETE Column stats: NONE
>                       Select Operator
>                         expressions: _col0 (type: int), _col1 (type: string), 
> _col2 (type: int), _col3 (type: int), _col7 (type: int), _col8 (type: string)
>                         outputColumnNames: _col0, _col1, _col2, _col3, _col4, 
> _col5
>                         Statistics: Num rows: 6 Data size: 52 Basic stats: 
> COMPLETE Column stats: NONE
>                         File Output Operator
>                           compressed: false
>                           Statistics: Num rows: 6 Data size: 52 Basic stats: 
> COMPLETE Column stats: NONE
>                           table:
>                               input format: 
> org.apache.hadoop.mapred.TextInputFormat
>                               output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                               serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>             Local Work:
>               Map Reduce Local Work
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {code}
> I can't find any information about union "test" in explain 2.
> Some properties on hive-site.xml is 
> {code}
> <property>
> <name>hive.ignore.mapjoin.hint</name>
> <value>false</value>
> </property>
> <property>
> <name>hive.auto.convert.join</name>
> <value>true</value>
> </property>
> <property>
> <name>hive.auto.convert.join.noconditionaltask</name>
> <value>true</value>
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-12736) It seems that result of Hive on Spark be mistaken and result of Hive and Hive on Spark are not the same

Reply via email to