[ https://issues.apache.org/jira/browse/HIVE-21196?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jesus Camacho Rodriguez resolved HIVE-21196. -------------------------------------------- Fix Version/s: 4.0.0 Resolution: Fixed Pushed to master, thanks [~zabetak]! > Support semijoin reduction on multiple column join > -------------------------------------------------- > > Key: HIVE-21196 > URL: https://issues.apache.org/jira/browse/HIVE-21196 > Project: Hive > Issue Type: Bug > Reporter: Deepak Jaiswal > Assignee: Stamatis Zampetakis > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3h 40m > Remaining Estimate: 0h > > Currently for a query involving join on multiple columns creates separate > semi join edges for each key which in turn create a bloom filter for each of > them, like below, > EXPLAIN select count(*) from srcpart_date_n7 join srcpart_small_n3 on > (srcpart_date_n7.key = srcpart_small_n3.key1 and srcpart_date_n7.value = > srcpart_small_n3.value1) > {code:java} > Map 1 <- Reducer 5 (BROADCAST_EDGE) > Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE) > Reducer 3 <- Reducer 2 (CUSTOM_SIMPLE_EDGE) > Reducer 5 <- Map 4 (CUSTOM_SIMPLE_EDGE) > #### A masked pattern was here #### > Vertices: > Map 1 > Map Operator Tree: > TableScan > alias: srcpart_date_n7 > filterExpr: (key is not null and value is not null and (key > BETWEEN DynamicValue(RS_7_srcpart_small_n3_key1_min) AND > DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, > DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter)))) (type: boolean) > Statistics: Num rows: 2000 Data size: 356000 Basic stats: > COMPLETE Column stats: COMPLETE > Filter Operator > predicate: ((key BETWEEN > DynamicValue(RS_7_srcpart_small_n3_key1_min) AND > DynamicValue(RS_7_srcpart_small_n3_key1_max) and in_bloom_filter(key, > DynamicValue(RS_7_srcpart_small_n3_key1_bloom_filter))) and key is not null > and value is not null) (type: boolean) > Statistics: Num rows: 2000 Data size: 356000 Basic stats: > COMPLETE Column stats: COMPLETE > Select Operator > expressions: key (type: string), value (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 2000 Data size: 356000 Basic > stats: COMPLETE Column stats: COMPLETE > Reduce Output Operator > key expressions: _col0 (type: string), _col1 (type: > string) > sort order: ++ > Map-reduce partition columns: _col0 (type: string), > _col1 (type: string) > Statistics: Num rows: 2000 Data size: 356000 Basic > stats: COMPLETE Column stats: COMPLETE > Execution mode: vectorized, llap > LLAP IO: all inputs > Map 4 > Map Operator Tree: > TableScan > alias: srcpart_small_n3 > filterExpr: (key1 is not null and value1 is not null) > (type: boolean) > Statistics: Num rows: 20 Data size: 3560 Basic stats: > PARTIAL Column stats: PARTIAL > Filter Operator > predicate: (key1 is not null and value1 is not null) > (type: boolean) > Statistics: Num rows: 20 Data size: 3560 Basic stats: > PARTIAL Column stats: PARTIAL > Select Operator > expressions: key1 (type: string), value1 (type: string) > outputColumnNames: _col0, _col1 > Statistics: Num rows: 20 Data size: 3560 Basic stats: > PARTIAL Column stats: PARTIAL > Reduce Output Operator > key expressions: _col0 (type: string), _col1 (type: > string) > sort order: ++ > Map-reduce partition columns: _col0 (type: string), > _col1 (type: string) > Statistics: Num rows: 20 Data size: 3560 Basic stats: > PARTIAL Column stats: PARTIAL > Select Operator > expressions: _col0 (type: string) > outputColumnNames: _col0 > Statistics: Num rows: 20 Data size: 3560 Basic stats: > PARTIAL Column stats: PARTIAL > Group By Operator > aggregations: min(_col0), max(_col0), > bloom_filter(_col0, expectedEntries=20) > mode: hash > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 730 Basic stats: > PARTIAL Column stats: PARTIAL > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 730 Basic > stats: PARTIAL Column stats: PARTIAL > value expressions: _col0 (type: string), _col1 > (type: string), _col2 (type: binary) > Execution mode: vectorized, llap > LLAP IO: all inputs > Reducer 2 > Execution mode: llap > Reduce Operator Tree: > Merge Join Operator > condition map: > Inner Join 0 to 1 > keys: > 0 _col0 (type: string), _col1 (type: string) > 1 _col0 (type: string), _col1 (type: string) > Statistics: Num rows: 2200 Data size: 391600 Basic stats: > PARTIAL Column stats: NONE > Group By Operator > aggregations: count() > mode: hash > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL > Column stats: NONE > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL > Column stats: NONE > value expressions: _col0 (type: bigint) > Reducer 3 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: count(VALUE._col0) > mode: mergepartial > outputColumnNames: _col0 > Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL > Column stats: NONE > File Output Operator > compressed: false > Statistics: Num rows: 1 Data size: 8 Basic stats: PARTIAL > Column stats: NONE > table: > input format: > org.apache.hadoop.mapred.SequenceFileInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat > serde: > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > Reducer 5 > Execution mode: vectorized, llap > Reduce Operator Tree: > Group By Operator > aggregations: min(VALUE._col0), max(VALUE._col1), > bloom_filter(VALUE._col2, expectedEntries=20) > mode: final > outputColumnNames: _col0, _col1, _col2 > Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL > Column stats: PARTIAL > Reduce Output Operator > sort order: > Statistics: Num rows: 1 Data size: 730 Basic stats: PARTIAL > Column stats: PARTIAL > value expressions: _col0 (type: string), _col1 (type: > string), _col2 (type: binary) > {code} > Instead it should create one branch for a join with one bloom filter. > > The implementation for bloom filter requires getting a hash out of all the > key columns and converting it to a long and feeding it to bloom filter as > input. This requires a new UDF which does this. It will be called at both > bloom filter generation and lookup phases. > The min and max will stay independent as they are today for each columns. > A vectorized implementation of such UDF is also required. -- This message was sent by Atlassian Jira (v8.3.4#803005)