[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

liyunzhang_intel (JIRA) Mon, 08 May 2017 23:08:27 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002113#comment-16002113
 ]


liyunzhang_intel commented on HIVE-11297:
-----------------------------------------

the explain plan of the multiple columns single source case in 
spark_dynamic_partition_pruning.q is 
{code}
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}

the explain plan is 
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: ds
                            target work: Map 1
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col2 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: hr
                            Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: hr
                            target work: Map 1

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL 
SORT, 2)
        Reducer 3 <- Reducer 2 (GROUP, 1)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
COMPLETE Column stats: NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
COMPLETE Column stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: 
string)
                      sort order: ++
                      Map-reduce partition columns: _col0 (type: string), _col1 
(type: string)
                      Statistics: Num rows: 2000 Data size: 21248 Basic stats: 
COMPLETE Column stats: NONE
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col2 (type: 
string)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: string), 
_col2 (type: string)
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string), _col1 (type: string)
                  1 _col0 (type: string), _col2 (type: string)
                Statistics: Num rows: 2200 Data size: 23372 Basic stats: 
COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    value expressions: _col0 (type: bigint)
        Reducer 3 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

Comparing Map5,Map6，actually they are similar except Spark Partition Pruning 
Sink Operator).

the query in tez, there is only 1 Map(Map4) contains {{Dynamic Partitioning 
Event Operator}}
{code}

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 2000 Data size: 757248 Basic stats: 
COMPLETE Column stats: COMPLETE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 2000 Data size: 736000 Basic stats: 
COMPLETE Column stats: COMPLETE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: 
string)
                      sort order: ++
                      Map-reduce partition columns: _col0 (type: string), _col1 
(type: string)
                      Statistics: Num rows: 2000 Data size: 736000 Basic stats: 
COMPLETE Column stats: COMPLETE
            Execution mode: llap
            LLAP IO: no inputs
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col2 (type: 
string)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: string), 
_col2 (type: string)
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          Dynamic Partitioning Event Operator
                            Target column: ds (string)
                            Target Input: srcpart
                            Partition key expr: ds
                            Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                            Target Vertex: Map 1
                      Select Operator
                        expressions: _col2 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          Dynamic Partitioning Event Operator
                            Target column: hr (string)
                            Target Input: srcpart
                            Partition key expr: hr
                            Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                            Target Vertex: Map 1
            Execution mode: llap
            LLAP IO: no inputs
        Reducer 2 
            Execution mode: llap
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string), _col1 (type: string)
                  1 _col0 (type: string), _col2 (type: string)
                Statistics: Num rows: 2200 Data size: 809600 Basic stats: 
COMPLETE Column stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                    value expressions: _col0 (type: bigint)
        Reducer 3 
            Execution mode: llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: NONE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{code}




> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

[jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

Reply via email to