[ https://issues.apache.org/jira/browse/HIVE-17082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gopal V resolved HIVE-17082. ---------------------------- Resolution: Not A Problem [~rajesh.balamohan]: the semi-join can be removed in this case, because there is no shuffle between the map-join and the semi-join operators. > Dynamic semi join gets turned off at compile time > ------------------------------------------------- > > Key: HIVE-17082 > URL: https://issues.apache.org/jira/browse/HIVE-17082 > Project: Hive > Issue Type: Bug > Reporter: Rajesh Balamohan > > With Hive-master: > ================= > {noformat} > 2017-07-13T08:35:55,042 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > optimizer.DynamicPartitionPruningOptimization: Initiate semijoin reduction > for sr_ticket_number ((sr_ticket_number is not null and (sr_ticket_number) IN > (RS[6])) > 2017-07-13T08:35:55,043 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > optimizer.DynamicPartitionPruningOptimization: DynamicSemiJoinPushdown: > Saving RS to TS mapping: RS[28]: TS[3] > 2017-07-13T08:35:55,398 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > optimizer.ConvertJoinMapJoin: Found semijoin optimization from the big table > side of a map join, which will cause a task cycle. Removing semijoin RS[28] - > TS[3] (store_returns) > 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > parse.TezCompiler: Computing key domain cardinality, > keyDomainCardinality=95121413, semiJoinKeyIsPK=false, selColStat= colName: > _col0 colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 > numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: > false, selColSourceStat= colName: sr_ticket_number colType: bigint > countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 > Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, tsColStat= colName: > ss_ticket_number colType: bigint countDistincts: 86758883 numNulls: 0 > avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] > isPrimaryKey: false > 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > parse.TezCompiler: SemiJoin key selectivity=0.08791427436007496, > benefit=2.6267959439021907E9 > 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > parse.TezCompiler: BloomFilter benefit=2.6267959439021907E9, > cost=2.87999764E8, tsDataSize=2879987999, netBenefit=2.3387961799021907E9 > 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > parse.TezCompiler: netBenefit=0.8120853908815856 > 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] > parse.TezCompiler: Semijoin optimization with parallel edge to map join. > Removing semijoin RS[23] - TS[0] (store_sales) > > explain select count(1) from store_sales, store_returns where > > sr_ticket_number = ss_ticket_number; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: rbalamohan_20170713083602_0ed509c0-0311-480e-a01c-bafcb259a5fe:3 > Edges: > Map 1 <- Map 3 (BROADCAST_EDGE) > Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) > DagName: > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: store_sales > filterExpr: ss_ticket_number is not null (type: boolean) > Statistics: Num rows: 2879987999 Data size: 23039903992 > Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ss_ticket_number is not null (type: boolean) > Statistics: Num rows: 2879987999 Data size: 23039903992 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: ss_ticket_number (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 2879987999 Data size: 23039903992 > Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col0 (type: bigint) > 1 _col0 (type: bigint) > input vertices: > 1 Map 3 > Statistics: Num rows: 9560241388 Data size: > 76481931104 Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: count() > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: bigint) > Execution mode: vectorized, llap > Map 3 > Map Operator Tree: > TableScan > alias: store_returns > filterExpr: sr_ticket_number is not null (type: boolean) > Statistics: Num rows: 287999764 Data size: 2303998112 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: sr_ticket_number is not null (type: boolean) > Statistics: Num rows: 287999764 Data size: 2303998112 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: sr_ticket_number (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 287999764 Data size: 2303998112 > Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: bigint) > sort order: + > Map-reduce partition columns: _col0 (type: bigint) > Statistics: Num rows: 287999764 Data size: 2303998112 > Basic stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized, llap > Reducer 2 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {noformat} > Without TezCompiler::removeSemijoinsParallelToMapJoin: > ====================================================== > Semi join gets invoked > {noformat} > > explain select count(1) from store_sales, store_returns where > sr_ticket_number = ss_ticket_number; > OK > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 depends on stages: Stage-1 > STAGE PLANS: > Stage: Stage-1 > Tez > DagId: rbalamohan_20170713082329_4c868b9a-6113-4da8-8c9a-66d9018e45c0:6 > Edges: > Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE) > Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) > Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE) > DagName: > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: store_sales > filterExpr: (ss_ticket_number is not null and > (ss_ticket_number BETWEEN > DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND > DynamicValue(RS_7_store_returns_sr_ticket_number_max) and > in_bloom_filter(ss_ticket_number, > DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: > boolean) > Statistics: Num rows: 2879987999 Data size: 23039903992 > Basic stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: (ss_ticket_number is not null and > (ss_ticket_number BETWEEN > DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND > DynamicValue(RS_7_store_returns_sr_ticket_number_max) and > in_bloom_filter(ss_ticket_number, > DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: > boolean) > Statistics: Num rows: 2879987999 Data size: 23039903992 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: ss_ticket_number (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 2879987999 Data size: 23039903992 > Basic stats: COMPLETE Column stats: COMPLETE > Map Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col0 (type: bigint) > 1 _col0 (type: bigint) > input vertices: > 1 Map 3 > Statistics: Num rows: 9560241388 Data size: > 76481931104 Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: count() > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 8 Basic stats: > COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: bigint) > Execution mode: vectorized, llap > Map 3 > Map Operator Tree: > TableScan > alias: store_returns > filterExpr: sr_ticket_number is not null (type: boolean) > Statistics: Num rows: 287999764 Data size: 2303998112 Basic > stats: COMPLETE Column stats: COMPLETE > Filter Operator > predicate: sr_ticket_number is not null (type: boolean) > Statistics: Num rows: 287999764 Data size: 2303998112 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: sr_ticket_number (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 287999764 Data size: 2303998112 > Basic stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: bigint) > sort order: + > Map-reduce partition columns: _col0 (type: bigint) > Statistics: Num rows: 287999764 Data size: 2303998112 > Basic stats: COMPLETE Column stats: COMPLETE > Select Operator > expressions: _col0 (type: bigint) > outputColumnNames: _col0 > Statistics: Num rows: 287999764 Data size: 2303998112 > Basic stats: COMPLETE Column stats: COMPLETE > Group By Operator > aggregations: min(_col0), max(_col0), > bloom_filter(_col0, expectedEntries=16725060) > mode: hash > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 24 Basic stats: > COMPLETE Column stats: COMPLETE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 24 Basic > stats: COMPLETE Column stats: COMPLETE > value expressions: _col0 (type: bigint), _col1 > (type: bigint), _col2 (type: binary) > Execution mode: vectorized, llap > Reducer 2 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE > Column stats: COMPLETE > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Reducer 4 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: min(VALUE._col0), max(VALUE._col1), > bloom_filter(VALUE._col2, expectedEntries=16725060) > mode: final > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE > Column stats: COMPLETE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE > Column stats: COMPLETE > value expressions: _col0 (type: bigint), _col1 (type: > bigint), _col2 (type: binary) > Stage: Stage-0 > Fetch Operator > limit: -1 > Processor Tree: > ListSink > {noformat} > Related ticket: HIVE-16260 -- This message was sent by Atlassian JIRA (v6.4.14#64029)