[ https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002113#comment-16002113 ]
liyunzhang_intel commented on HIVE-11297: ----------------------------------------- the explain plan of the multiple columns single source case in spark_dynamic_partition_pruning.q is {code} -- multiple columns single source EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = srcpart_date_hour.ds and srcpart.hr = srcpart_date_hour.hr) where srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour = 11; {code} the explain plan is {code} STAGE DEPENDENCIES: Stage-2 is a root stage Stage-1 depends on stages: Stage-2 Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-2 Spark #### A masked pattern was here #### Vertices: Map 5 Map Operator Tree: TableScan alias: srcpart_date_hour filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ds (type: string), hr (type: string) outputColumnNames: _col0, _col2 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Spark Partition Pruning Sink Operator partition key expr: ds Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE target column name: ds target work: Map 1 Map 6 Map Operator Tree: TableScan alias: srcpart_date_hour filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ds (type: string), hr (type: string) outputColumnNames: _col0, _col2 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col2 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Spark Partition Pruning Sink Operator partition key expr: hr Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE target column name: hr target work: Map 1 Stage: Stage-1 Spark Edges: Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL SORT, 2) Reducer 3 <- Reducer 2 (GROUP, 1) #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: srcpart Statistics: Num rows: 2000 Data size: 21248 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ds (type: string), hr (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 2000 Data size: 21248 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string) Statistics: Num rows: 2000 Data size: 21248 Basic stats: COMPLETE Column stats: NONE Map 4 Map Operator Tree: TableScan alias: srcpart_date_hour filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ds (type: string), hr (type: string) outputColumnNames: _col0, _col2 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col2 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string), _col2 (type: string) Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Reducer 2 Reduce Operator Tree: Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: string), _col1 (type: string) 1 _col0 (type: string), _col2 (type: string) Statistics: Num rows: 2200 Data size: 23372 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reducer 3 Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} Comparing Map5,Map6,actually they are similar except Spark Partition Pruning Sink Operator). the query in tez, there is only 1 Map(Map4) contains {{Dynamic Partitioning Event Operator}} {code} STAGE DEPENDENCIES: Stage-1 is a root stage Stage-0 depends on stages: Stage-1 STAGE PLANS: Stage: Stage-1 Tez #### A masked pattern was here #### Edges: Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) Reducer 3 <- Reducer 2 (SIMPLE_EDGE) #### A masked pattern was here #### Vertices: Map 1 Map Operator Tree: TableScan alias: srcpart Statistics: Num rows: 2000 Data size: 757248 Basic stats: COMPLETE Column stats: COMPLETE Select Operator expressions: ds (type: string), hr (type: string) outputColumnNames: _col0, _col1 Statistics: Num rows: 2000 Data size: 736000 Basic stats: COMPLETE Column stats: COMPLETE Reduce Output Operator key expressions: _col0 (type: string), _col1 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string), _col1 (type: string) Statistics: Num rows: 2000 Data size: 736000 Basic stats: COMPLETE Column stats: COMPLETE Execution mode: llap LLAP IO: no inputs Map 4 Map Operator Tree: TableScan alias: srcpart_date_hour filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is not null and hr is not null) (type: boolean) Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ds (type: string), hr (type: string) outputColumnNames: _col0, _col2 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator key expressions: _col0 (type: string), _col2 (type: string) sort order: ++ Map-reduce partition columns: _col0 (type: string), _col2 (type: string) Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: _col0 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Dynamic Partitioning Event Operator Target column: ds (string) Target Input: srcpart Partition key expr: ds Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Target Vertex: Map 1 Select Operator expressions: _col2 (type: string) outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Group By Operator keys: _col0 (type: string) mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Dynamic Partitioning Event Operator Target column: hr (string) Target Input: srcpart Partition key expr: hr Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats: NONE Target Vertex: Map 1 Execution mode: llap LLAP IO: no inputs Reducer 2 Execution mode: llap Reduce Operator Tree: Merge Join Operator condition map: Inner Join 0 to 1 keys: 0 _col0 (type: string), _col1 (type: string) 1 _col0 (type: string), _col2 (type: string) Statistics: Num rows: 2200 Data size: 809600 Basic stats: COMPLETE Column stats: NONE Group By Operator aggregations: count() mode: hash outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE Reduce Output Operator sort order: Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE value expressions: _col0 (type: bigint) Reducer 3 Execution mode: llap Reduce Operator Tree: Group By Operator aggregations: count(VALUE._col0) mode: mergepartial outputColumnNames: _col0 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE File Output Operator compressed: false Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE table: input format: org.apache.hadoop.mapred.SequenceFileInputFormat output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe Stage: Stage-0 Fetch Operator limit: -1 Processor Tree: ListSink {code} > Combine op trees for partition info generating tasks [Spark branch] > ------------------------------------------------------------------- > > Key: HIVE-11297 > URL: https://issues.apache.org/jira/browse/HIVE-11297 > Project: Hive > Issue Type: Bug > Affects Versions: spark-branch > Reporter: Chao Sun > Assignee: liyunzhang_intel > > Currently, for dynamic partition pruning in Spark, if a small table generates > partition info for more than one partition columns, multiple operator trees > are created, which all start from the same table scan op, but have different > spark partition pruning sinks. > As an optimization, we can combine these op trees and so don't have to do > table scan multiple times. -- This message was sent by Atlassian JIRA (v6.3.15#6346)