[ 
https://issues.apache.org/jira/browse/HIVE-17082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gopal V resolved HIVE-17082.
----------------------------
    Resolution: Not A Problem

[~rajesh.balamohan]: the semi-join can be removed in this case, because there 
is no shuffle between the map-join and the semi-join operators.

> Dynamic semi join gets turned off at compile time
> -------------------------------------------------
>
>                 Key: HIVE-17082
>                 URL: https://issues.apache.org/jira/browse/HIVE-17082
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Rajesh Balamohan
>
> With Hive-master:
> =================
> {noformat}
> 2017-07-13T08:35:55,042 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> optimizer.DynamicPartitionPruningOptimization: Initiate semijoin reduction 
> for sr_ticket_number ((sr_ticket_number is not null and (sr_ticket_number) IN 
> (RS[6]))
> 2017-07-13T08:35:55,043 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> optimizer.DynamicPartitionPruningOptimization: DynamicSemiJoinPushdown: 
> Saving RS to TS mapping: RS[28]: TS[3]
> 2017-07-13T08:35:55,398 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> optimizer.ConvertJoinMapJoin: Found semijoin optimization from the big table 
> side of a map join, which will cause a task cycle. Removing semijoin RS[28] - 
> TS[3] (store_returns)
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> parse.TezCompiler: Computing key domain cardinality, 
> keyDomainCardinality=95121413, semiJoinKeyIsPK=false, selColStat= colName: 
> _col0 colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 
> numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: 
> false, selColSourceStat= colName: sr_ticket_number colType: bigint 
> countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 
> Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, tsColStat= colName: 
> ss_ticket_number colType: bigint countDistincts: 86758883 numNulls: 0 
> avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] 
> isPrimaryKey: false
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> parse.TezCompiler: SemiJoin key selectivity=0.08791427436007496, 
> benefit=2.6267959439021907E9
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> parse.TezCompiler: BloomFilter benefit=2.6267959439021907E9, 
> cost=2.87999764E8, tsDataSize=2879987999, netBenefit=2.3387961799021907E9
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> parse.TezCompiler: netBenefit=0.8120853908815856
> 2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
> parse.TezCompiler: Semijoin optimization with parallel edge to map join. 
> Removing semijoin RS[23] - TS[0] (store_sales)
> > explain select count(1) from store_sales, store_returns where 
> > sr_ticket_number = ss_ticket_number;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: rbalamohan_20170713083602_0ed509c0-0311-480e-a01c-bafcb259a5fe:3
>       Edges:
>         Map 1 <- Map 3 (BROADCAST_EDGE)
>         Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
>       DagName:
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: ss_ticket_number is not null (type: boolean)
>                   Statistics: Num rows: 2879987999 Data size: 23039903992 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: ss_ticket_number is not null (type: boolean)
>                     Statistics: Num rows: 2879987999 Data size: 23039903992 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: ss_ticket_number (type: bigint)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 2879987999 Data size: 23039903992 
> Basic stats: COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0 _col0 (type: bigint)
>                           1 _col0 (type: bigint)
>                         input vertices:
>                           1 Map 3
>                         Statistics: Num rows: 9560241388 Data size: 
> 76481931104 Basic stats: COMPLETE Column stats: COMPLETE
>                         Group By Operator
>                           aggregations: count()
>                           mode: hash
>                           outputColumnNames: _col0
>                           Statistics: Num rows: 1 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                           Reduce Output Operator
>                             sort order:
>                             Statistics: Num rows: 1 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                             value expressions: _col0 (type: bigint)
>             Execution mode: vectorized, llap
>         Map 3
>             Map Operator Tree:
>                 TableScan
>                   alias: store_returns
>                   filterExpr: sr_ticket_number is not null (type: boolean)
>                   Statistics: Num rows: 287999764 Data size: 2303998112 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: sr_ticket_number is not null (type: boolean)
>                     Statistics: Num rows: 287999764 Data size: 2303998112 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: sr_ticket_number (type: bigint)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 287999764 Data size: 2303998112 
> Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: bigint)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: bigint)
>                         Statistics: Num rows: 287999764 Data size: 2303998112 
> Basic stats: COMPLETE Column stats: COMPLETE
>             Execution mode: vectorized, llap
>         Reducer 2
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {noformat}
> Without TezCompiler::removeSemijoinsParallelToMapJoin:
> ======================================================
> Semi join gets invoked
> {noformat}
>  > explain select count(1) from store_sales, store_returns where 
> sr_ticket_number = ss_ticket_number;
> OK
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       DagId: rbalamohan_20170713082329_4c868b9a-6113-4da8-8c9a-66d9018e45c0:6
>       Edges:
>         Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
>         Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
>         Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
>       DagName:
>       Vertices:
>         Map 1
>             Map Operator Tree:
>                 TableScan
>                   alias: store_sales
>                   filterExpr: (ss_ticket_number is not null and 
> (ss_ticket_number BETWEEN 
> DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND 
> DynamicValue(RS_7_store_returns_sr_ticket_number_max) and 
> in_bloom_filter(ss_ticket_number, 
> DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: 
> boolean)
>                   Statistics: Num rows: 2879987999 Data size: 23039903992 
> Basic stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: (ss_ticket_number is not null and 
> (ss_ticket_number BETWEEN 
> DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND 
> DynamicValue(RS_7_store_returns_sr_ticket_number_max) and 
> in_bloom_filter(ss_ticket_number, 
> DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: 
> boolean)
>                     Statistics: Num rows: 2879987999 Data size: 23039903992 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: ss_ticket_number (type: bigint)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 2879987999 Data size: 23039903992 
> Basic stats: COMPLETE Column stats: COMPLETE
>                       Map Join Operator
>                         condition map:
>                              Inner Join 0 to 1
>                         keys:
>                           0 _col0 (type: bigint)
>                           1 _col0 (type: bigint)
>                         input vertices:
>                           1 Map 3
>                         Statistics: Num rows: 9560241388 Data size: 
> 76481931104 Basic stats: COMPLETE Column stats: COMPLETE
>                         Group By Operator
>                           aggregations: count()
>                           mode: hash
>                           outputColumnNames: _col0
>                           Statistics: Num rows: 1 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                           Reduce Output Operator
>                             sort order:
>                             Statistics: Num rows: 1 Data size: 8 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                             value expressions: _col0 (type: bigint)
>             Execution mode: vectorized, llap
>         Map 3
>             Map Operator Tree:
>                 TableScan
>                   alias: store_returns
>                   filterExpr: sr_ticket_number is not null (type: boolean)
>                   Statistics: Num rows: 287999764 Data size: 2303998112 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                   Filter Operator
>                     predicate: sr_ticket_number is not null (type: boolean)
>                     Statistics: Num rows: 287999764 Data size: 2303998112 
> Basic stats: COMPLETE Column stats: COMPLETE
>                     Select Operator
>                       expressions: sr_ticket_number (type: bigint)
>                       outputColumnNames: _col0
>                       Statistics: Num rows: 287999764 Data size: 2303998112 
> Basic stats: COMPLETE Column stats: COMPLETE
>                       Reduce Output Operator
>                         key expressions: _col0 (type: bigint)
>                         sort order: +
>                         Map-reduce partition columns: _col0 (type: bigint)
>                         Statistics: Num rows: 287999764 Data size: 2303998112 
> Basic stats: COMPLETE Column stats: COMPLETE
>                       Select Operator
>                         expressions: _col0 (type: bigint)
>                         outputColumnNames: _col0
>                         Statistics: Num rows: 287999764 Data size: 2303998112 
> Basic stats: COMPLETE Column stats: COMPLETE
>                         Group By Operator
>                           aggregations: min(_col0), max(_col0), 
> bloom_filter(_col0, expectedEntries=16725060)
>                           mode: hash
>                           outputColumnNames: _col0, _col1, _col2
>                           Statistics: Num rows: 1 Data size: 24 Basic stats: 
> COMPLETE Column stats: COMPLETE
>                           Reduce Output Operator
>                             sort order:
>                             Statistics: Num rows: 1 Data size: 24 Basic 
> stats: COMPLETE Column stats: COMPLETE
>                             value expressions: _col0 (type: bigint), _col1 
> (type: bigint), _col2 (type: binary)
>             Execution mode: vectorized, llap
>         Reducer 2
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: count(VALUE._col0)
>                 mode: mergepartial
>                 outputColumnNames: _col0
>                 Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 File Output Operator
>                   compressed: false
>                   Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   table:
>                       input format: 
> org.apache.hadoop.mapred.SequenceFileInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
>                       serde: 
> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 4
>             Execution mode: vectorized, llap
>             Reduce Operator Tree:
>               Group By Operator
>                 aggregations: min(VALUE._col0), max(VALUE._col1), 
> bloom_filter(VALUE._col2, expectedEntries=16725060)
>                 mode: final
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                 Reduce Output Operator
>                   sort order:
>                   Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE 
> Column stats: COMPLETE
>                   value expressions: _col0 (type: bigint), _col1 (type: 
> bigint), _col2 (type: binary)
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {noformat}
> Related ticket: HIVE-16260



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to