[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

liyunzhang_intel (JIRA) Tue, 13 Jun 2017 00:01:34 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


liyunzhang_intel updated HIVE-11297:
------------------------------------
    Attachment: HIVE-11297.3.patch

[~csun]:  update SplitOpTreeForDPP and to split the trees like what you 
mentioned last time.
because the explain plan is changed after this jira
{code}
set hive.execution.engine=spark; 
set hive.auto.convert.join.noconditionaltask.size=20; 
set hive.spark.dynamic.partition.pruning=true;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = 
srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where 
srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11;
{code}

before
{code}
STAGE PLANS:
  Stage: Stage-2
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: ds
                            target work: Map 1
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Select Operator
                        expressions: _col2 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: hr
                            Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                            target column name: hr
                            target work: Map 1

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL 
SORT, 2)
        Reducer 3 <- Reducer 2 (GROUP, 1)
{code}

now
{code}
Stage: Stage-2  Spark
#### A masked pattern was here ####
    Vertices:
      Map 5 
          Map Operator Tree:
              TableScan
                alias: srcpart_date_hour
                filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE 
Column stats: NONE
                Filter Operator
                  predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 
11.0) and ds is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col2
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE 
Column stats: NONE
                    Select Operator
                      expressions: _col0 (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Group By Operator
                        keys: _col0 (type: string)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Spark Partition Pruning Sink Operator
                          partition key expr: ds
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          target column name: ds
                          target work: Map 1
                    Select Operator
                      expressions: _col2 (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                      Group By Operator
                        keys: _col0 (type: string)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                        Spark Partition Pruning Sink Operator
                          partition key expr: hr
                          Statistics: Num rows: 1 Data size: 27 Basic stats: 
COMPLETE Column stats: NONE
                          target column name: hr
                          target work: Map 1

Stage: Stage-1
  Spark
    Edges:
      Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL 
SORT, 2)
      Reducer 3 <- Reducer 2 (GROUP, 1)
{code}

but when i use following command to generate new 
spark_dynamic_partition_pruning.q.out
{code}
mvn clean test -Dtest=TestSparkCliDriver -Dtest.output.overwrite=true -Dqfile= 
spark_dynamic_partition_pruning.q
{code}

I found it not only changed above explain, but also change others. the changes 
like
1. comment like "SORT_QUERY_RESULTS" deleted, how to keep original format?
{code}
-PREHOOK: query: -- SORT_QUERY_RESULTS
-
-select distinct ds from srcpart
+PREHOOK: query: select distinct ds from srcpart
+POSTHOOK: query: EXPLAIN select count(*) from srcpart join srcpart_date on 
(srcpart.ds = srcpart_date.ds) join srcpart_hour on (srcpart.hr = 
srcpart_hour.hr) 
{code}

2.  some changes is not caused by HIVE-11297.3.patch. like "filter Operator is 
added in the explain plan"
{code}
POSTHOOK: type: QUERY
 STAGE DEPENDENCIES:
   Stage-2 is a root stage
@@ -3168,16 +3141,19 @@ STAGE PLANS:
                 mode: mergepartial
                 outputColumnNames: _col0
                 Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
Column stats: NONE
-                Group By Operator
-                  keys: _col0 (type: string)
-                  mode: hash
-                  outputColumnNames: _col0
-                  Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE 
Column stats: NONE
-                  Reduce Output Operator
-                    key expressions: _col0 (type: string)
-                    sort order: +
-                    Map-reduce partition columns: _col0 (type: string)
+                Filter Operator
+                  predicate: _col0 is not null (type: boolean)
+                  Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE 
Column stats: NONE
+                  Group By Operator
+                    keys: _col0 (type: string)
+                    mode: hash
+                    outputColumnNames: _col0
                     Statistics: Num rows: 2 Data size: 368 Basic stats: 
COMPLETE Column stats: NONE
+                    Reduce Output Operator
+                      key expressions: _col0 (type: string)
+                      sort order: +
+                      Map-reduce partition columns: _col0 (type: string)
+                      Statistics: Num rows: 2 Data size: 368 Basic stats: 
COMPLETE Column stats: NONE
{code}

How to solve the changes in spark.dynamic.partition.pruning.q.out?
1. just copy the change caused by HIVE-11297.3
2. use "-Dtest.output.overwrite=true" to generate a new *q.out
which do you prefer?


> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, 
> HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates 
> partition info for more than one partition columns, multiple operator trees 
> are created, which all start from the same table scan op, but have different 
> spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do 
> table scan multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]

Reply via email to