Rajesh Balamohan created HIVE-17082:
---------------------------------------

             Summary: Dynamic semi join gets turned off at compile time
                 Key: HIVE-17082
                 URL: https://issues.apache.org/jira/browse/HIVE-17082
             Project: Hive
          Issue Type: Bug
            Reporter: Rajesh Balamohan


With Hive-master:
=================

{noformat}


2017-07-13T08:35:55,042 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
optimizer.DynamicPartitionPruningOptimization: Initiate semijoin reduction for 
sr_ticket_number ((sr_ticket_number is not null and (sr_ticket_number) IN 
(RS[6]))
2017-07-13T08:35:55,043 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
optimizer.DynamicPartitionPruningOptimization: DynamicSemiJoinPushdown: Saving 
RS to TS mapping: RS[28]: TS[3]
2017-07-13T08:35:55,398 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
optimizer.ConvertJoinMapJoin: Found semijoin optimization from the big table 
side of a map join, which will cause a task cycle. Removing semijoin RS[28] - 
TS[3] (store_returns)
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
parse.TezCompiler: Computing key domain cardinality, 
keyDomainCardinality=95121413, semiJoinKeyIsPK=false, selColStat= colName: 
_col0 colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 
numTrues: 0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, 
selColSourceStat= colName: sr_ticket_number colType: bigint countDistincts: 
8362530 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses: 0 Range: [ min: 1 
max: 240000000 ] isPrimaryKey: false, tsColStat= colName: ss_ticket_number 
colType: bigint countDistincts: 86758883 numNulls: 0 avgColLen: 8.0 numTrues: 0 
numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
parse.TezCompiler: SemiJoin key selectivity=0.08791427436007496, 
benefit=2.6267959439021907E9
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
parse.TezCompiler: BloomFilter benefit=2.6267959439021907E9, cost=2.87999764E8, 
tsDataSize=2879987999, netBenefit=2.3387961799021907E9
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
parse.TezCompiler: netBenefit=0.8120853908815856
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] 
parse.TezCompiler: Semijoin optimization with parallel edge to map join. 
Removing semijoin RS[23] - TS[0] (store_sales)

> explain select count(1) from store_sales, store_returns where 
> sr_ticket_number = ss_ticket_number;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: rbalamohan_20170713083602_0ed509c0-0311-480e-a01c-bafcb259a5fe:3
      Edges:
        Map 1 <- Map 3 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
      DagName:
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_ticket_number is not null (type: boolean)
                  Statistics: Num rows: 2879987999 Data size: 23039903992 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_ticket_number is not null (type: boolean)
                    Statistics: Num rows: 2879987999 Data size: 23039903992 
Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ss_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 2879987999 Data size: 23039903992 
Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: bigint)
                          1 _col0 (type: bigint)
                        input vertices:
                          1 Map 3
                        Statistics: Num rows: 9560241388 Data size: 76481931104 
Basic stats: COMPLETE Column stats: COMPLETE
                        Group By Operator
                          aggregations: count()
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            sort order:
                            Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                            value expressions: _col0 (type: bigint)
            Execution mode: vectorized, llap
        Map 3
            Map Operator Tree:
                TableScan
                  alias: store_returns
                  filterExpr: sr_ticket_number is not null (type: boolean)
                  Statistics: Num rows: 287999764 Data size: 2303998112 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: sr_ticket_number is not null (type: boolean)
                    Statistics: Num rows: 287999764 Data size: 2303998112 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: sr_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 287999764 Data size: 2303998112 
Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: bigint)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: bigint)
                        Statistics: Num rows: 287999764 Data size: 2303998112 
Basic stats: COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink



{noformat}

Without TezCompiler::removeSemijoinsParallelToMapJoin:
======================================================

Semi join gets invoked

{noformat}


 > explain select count(1) from store_sales, store_returns where 
 > sr_ticket_number = ss_ticket_number;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: rbalamohan_20170713082329_4c868b9a-6113-4da8-8c9a-66d9018e45c0:6
      Edges:
        Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
        Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
      DagName:
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_ticket_number is not null and 
(ss_ticket_number BETWEEN DynamicValue(RS_7_store_returns_sr_ticket_number_min) 
AND DynamicValue(RS_7_store_returns_sr_ticket_number_max) and 
in_bloom_filter(ss_ticket_number, 
DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: 
boolean)
                  Statistics: Num rows: 2879987999 Data size: 23039903992 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: (ss_ticket_number is not null and 
(ss_ticket_number BETWEEN DynamicValue(RS_7_store_returns_sr_ticket_number_min) 
AND DynamicValue(RS_7_store_returns_sr_ticket_number_max) and 
in_bloom_filter(ss_ticket_number, 
DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter)))) (type: 
boolean)
                    Statistics: Num rows: 2879987999 Data size: 23039903992 
Basic stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: ss_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 2879987999 Data size: 23039903992 
Basic stats: COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: bigint)
                          1 _col0 (type: bigint)
                        input vertices:
                          1 Map 3
                        Statistics: Num rows: 9560241388 Data size: 76481931104 
Basic stats: COMPLETE Column stats: COMPLETE
                        Group By Operator
                          aggregations: count()
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            sort order:
                            Statistics: Num rows: 1 Data size: 8 Basic stats: 
COMPLETE Column stats: COMPLETE
                            value expressions: _col0 (type: bigint)
            Execution mode: vectorized, llap
        Map 3
            Map Operator Tree:
                TableScan
                  alias: store_returns
                  filterExpr: sr_ticket_number is not null (type: boolean)
                  Statistics: Num rows: 287999764 Data size: 2303998112 Basic 
stats: COMPLETE Column stats: COMPLETE
                  Filter Operator
                    predicate: sr_ticket_number is not null (type: boolean)
                    Statistics: Num rows: 287999764 Data size: 2303998112 Basic 
stats: COMPLETE Column stats: COMPLETE
                    Select Operator
                      expressions: sr_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 287999764 Data size: 2303998112 
Basic stats: COMPLETE Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: bigint)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: bigint)
                        Statistics: Num rows: 287999764 Data size: 2303998112 
Basic stats: COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: _col0 (type: bigint)
                        outputColumnNames: _col0
                        Statistics: Num rows: 287999764 Data size: 2303998112 
Basic stats: COMPLETE Column stats: COMPLETE
                        Group By Operator
                          aggregations: min(_col0), max(_col0), 
bloom_filter(_col0, expectedEntries=16725060)
                          mode: hash
                          outputColumnNames: _col0, _col1, _col2
                          Statistics: Num rows: 1 Data size: 24 Basic stats: 
COMPLETE Column stats: COMPLETE
                          Reduce Output Operator
                            sort order:
                            Statistics: Num rows: 1 Data size: 24 Basic stats: 
COMPLETE Column stats: COMPLETE
                            value expressions: _col0 (type: bigint), _col1 
(type: bigint), _col2 (type: binary)
            Execution mode: vectorized, llap
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE 
Column stats: COMPLETE
                  table:
                      input format: 
org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: 
org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        Reducer 4
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: min(VALUE._col0), max(VALUE._col1), 
bloom_filter(VALUE._col2, expectedEntries=16725060)
                mode: final
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE 
Column stats: COMPLETE
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE 
Column stats: COMPLETE
                  value expressions: _col0 (type: bigint), _col1 (type: 
bigint), _col2 (type: binary)

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{noformat}

Related ticket: HIVE-16260



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to