[jira] [Updated] (HIVE-15146) Too many Stats-Aggr Operator in multi-insert

Eugene Koifman (JIRA) Mon, 07 Nov 2016 17:10:16 -0800

     [ 
https://issues.apache.org/jira/browse/HIVE-15146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Eugene Koifman updated HIVE-15146:
----------------------------------
    Description: 
Consider:
{noformat}
create table if not exists  srcpart (a int, b int, c int)
partitioned by (z int)
clustered by (a) into 2 buckets
stored as orc
tblproperties("transactional"="true");


create temporary table if not exists data1 (x int);

insert into data1 values (1),(2),(3);

explain from data1
insert into srcpart partition(z) select 0,0,1,x
insert into srcpart partition(z=1) select 0,0,1;
{noformat}

Then the plan looks like:
{noformat}
2016-11-07T16:56:19,045  INFO [main] ql.TestTxnCommands2: STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0
  Stage-4 depends on stages: Stage-2
  Stage-1 depends on stages: Stage-4
  Stage-5 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: data1
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: x (type: int)
              outputColumnNames: _col3
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
              Reduce Output Operator
                sort order:
                Map-reduce partition columns: 0 (type: int)
                Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE 
Column stats: NONE
                value expressions: _col3 (type: int)
            Select Operator
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
              File Output Operator
                compressed: false
                table:
                    input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      Reduce Operator Tree:
        Select Operator
          expressions: 0 (type: int), 0 (type: int), 1 (type: int), VALUE._col2 
(type: int)
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
            table:
                input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                name: default.srcpart

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            z
          replace: false
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: default.srcpart

  Stage: Stage-3
    Stats-Aggr Operator

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              sort order:
              Map-reduce partition columns: 0 (type: int)
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
      Reduce Operator Tree:
        Select Operator
          expressions: 0 (type: int), 0 (type: int), 1 (type: int)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
            table:
                input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                name: default.srcpart

  Stage: Stage-1
    Move Operator
      tables:
          partition:
            z 1
          replace: false
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: default.srcpart

  Stage: Stage-5
    Stats-Aggr Operator
{noformat}

Note that there are 2 stats aggregation tasks but both branches of the 
multi-insert update the same partition

Once HIVE-14943 is in, there will be other ways to generate the same situation.

In particular it will be possible to have 2 or 3 branches of the multi-insert 
any or all of which are using dynamic partition insert which means the set of 
partitions actually updated is not known until run-time.

If at all possible, the solution should address this.


  was:
Consider:
{noformat}
create table if not exists  srcpart (a int, b int, c int)
partitioned by (z int)
clustered by (a) into 2 buckets
stored as orc
tblproperties("transactional"="true");


create temporary table if not exists data1 (x int);

insert into data1 values (1),(2),(3);

explain from data1
insert into srcpart partition(z) select 0,0,1,x
insert into srcpart partition(z=1) select 0,0,1;
{noformat}

Then the plan looks like:
{noformat}
2016-11-07T16:56:19,045  INFO [main] ql.TestTxnCommands2: STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-0 depends on stages: Stage-2
  Stage-3 depends on stages: Stage-0
  Stage-4 depends on stages: Stage-2
  Stage-1 depends on stages: Stage-4
  Stage-5 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: data1
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
            Select Operator
              expressions: x (type: int)
              outputColumnNames: _col3
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
              Reduce Output Operator
                sort order:
                Map-reduce partition columns: 0 (type: int)
                Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE 
Column stats: NONE
                value expressions: _col3 (type: int)
            Select Operator
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
              File Output Operator
                compressed: false
                table:
                    input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                    output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                    serde: 
org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
      Reduce Operator Tree:
        Select Operator
          expressions: 0 (type: int), 0 (type: int), 1 (type: int), VALUE._col2 
(type: int)
          outputColumnNames: _col0, _col1, _col2, _col3
          Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
            table:
                input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                name: default.srcpart

  Stage: Stage-0
    Move Operator
      tables:
          partition:
            z
          replace: false
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: default.srcpart

  Stage: Stage-3
    Stats-Aggr Operator

  Stage: Stage-4
    Map Reduce
      Map Operator Tree:
          TableScan
            Reduce Output Operator
              sort order:
              Map-reduce partition columns: 0 (type: int)
              Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
      Reduce Operator Tree:
        Select Operator
          expressions: 0 (type: int), 0 (type: int), 1 (type: int)
          outputColumnNames: _col0, _col1, _col2
          Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
          File Output Operator
            compressed: false
            Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
stats: NONE
            table:
                input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
                output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
                serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
                name: default.srcpart

  Stage: Stage-1
    Move Operator
      tables:
          partition:
            z 1
          replace: false
          table:
              input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
              output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
              serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
              name: default.srcpart

  Stage: Stage-5
    Stats-Aggr Operator
{noformat}

Note that there are 2 stats aggregation tasks but both branches of the 
multi-insert update the same partition

Once HIVE-14943 is in, there will be other ways to generate the same sitation


> Too many Stats-Aggr Operator in multi-insert
> --------------------------------------------
>
>                 Key: HIVE-15146
>                 URL: https://issues.apache.org/jira/browse/HIVE-15146
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning
>            Reporter: Eugene Koifman
>            Assignee: Pengcheng Xiong
>
> Consider:
> {noformat}
> create table if not exists  srcpart (a int, b int, c int)
> partitioned by (z int)
> clustered by (a) into 2 buckets
> stored as orc
> tblproperties("transactional"="true");
> create temporary table if not exists data1 (x int);
> insert into data1 values (1),(2),(3);
> explain from data1
> insert into srcpart partition(z) select 0,0,1,x
> insert into srcpart partition(z=1) select 0,0,1;
> {noformat}
> Then the plan looks like:
> {noformat}
> 2016-11-07T16:56:19,045  INFO [main] ql.TestTxnCommands2: STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-0 depends on stages: Stage-2
>   Stage-3 depends on stages: Stage-0
>   Stage-4 depends on stages: Stage-2
>   Stage-1 depends on stages: Stage-4
>   Stage-5 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: data1
>             Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
> stats: NONE
>             Select Operator
>               expressions: x (type: int)
>               outputColumnNames: _col3
>               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE 
> Column stats: NONE
>               Reduce Output Operator
>                 sort order:
>                 Map-reduce partition columns: 0 (type: int)
>                 Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE 
> Column stats: NONE
>                 value expressions: _col3 (type: int)
>             Select Operator
>               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE 
> Column stats: NONE
>               File Output Operator
>                 compressed: false
>                 table:
>                     input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                     output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                     serde: 
> org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe
>       Reduce Operator Tree:
>         Select Operator
>           expressions: 0 (type: int), 0 (type: int), 1 (type: int), 
> VALUE._col2 (type: int)
>           outputColumnNames: _col0, _col1, _col2, _col3
>           Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
> stats: NONE
>           File Output Operator
>             compressed: false
>             Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
> stats: NONE
>             table:
>                 input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                 output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                 serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                 name: default.srcpart
>   Stage: Stage-0
>     Move Operator
>       tables:
>           partition:
>             z
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.srcpart
>   Stage: Stage-3
>     Stats-Aggr Operator
>   Stage: Stage-4
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             Reduce Output Operator
>               sort order:
>               Map-reduce partition columns: 0 (type: int)
>               Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE 
> Column stats: NONE
>       Reduce Operator Tree:
>         Select Operator
>           expressions: 0 (type: int), 0 (type: int), 1 (type: int)
>           outputColumnNames: _col0, _col1, _col2
>           Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
> stats: NONE
>           File Output Operator
>             compressed: false
>             Statistics: Num rows: 1 Data size: 6 Basic stats: COMPLETE Column 
> stats: NONE
>             table:
>                 input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>                 output format: 
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>                 serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>                 name: default.srcpart
>   Stage: Stage-1
>     Move Operator
>       tables:
>           partition:
>             z 1
>           replace: false
>           table:
>               input format: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
>               output format: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
>               serde: org.apache.hadoop.hive.ql.io.orc.OrcSerde
>               name: default.srcpart
>   Stage: Stage-5
>     Stats-Aggr Operator
> {noformat}
> Note that there are 2 stats aggregation tasks but both branches of the 
> multi-insert update the same partition
> Once HIVE-14943 is in, there will be other ways to generate the same 
> situation.
> In particular it will be possible to have 2 or 3 branches of the multi-insert 
> any or all of which are using dynamic partition insert which means the set of 
> partitions actually updated is not known until run-time.
> If at all possible, the solution should address this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-15146) Too many Stats-Aggr Operator in multi-insert

Reply via email to