[ https://issues.apache.org/jira/browse/HIVE-28480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Zhihua Deng updated HIVE-28480: ------------------------------- Labels: hive-4.0.1-merged hive-4.0.1-must pull-request-available (was: hive-4.0.1-must pull-request-available) > Disable SMB on partition hash generator mismatch across join branches in > previous RS > ------------------------------------------------------------------------------------ > > Key: HIVE-28480 > URL: https://issues.apache.org/jira/browse/HIVE-28480 > Project: Hive > Issue Type: Bug > Components: Query Planning > Reporter: Himanshu Mishra > Assignee: Himanshu Mishra > Priority: Critical > Labels: hive-4.0.1-merged, hive-4.0.1-must, > pull-request-available > Fix For: 4.1.0, 4.0.1 > > > As SMB replaces last RS op from the joining branches and the JOIN op with > MERGEJOIN, we need to ensure the RS before these RS, in both branches, are > partitioning using same hash generator. > Hash code generator differs based on ReducerTraits.UNIFORM i.e. > [ReduceSinkOperator#computeMurmurHash() or > ReduceSinkOperator#computeHashCode()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/ReduceSinkOperator.java#L340-L344], > leading to different hash code for same value. > Skip SMB join in such cases. > h3. Replication: > Consider following query, where join would get converted to SMB. Auto reducer > is enabled which ensures more than 1 reducer task. > > {code:java} > CREATE TABLE t_asj_18 (k STRING, v INT); > INSERT INTO t_asj_18 values ('a', 10), ('a', 10); > set hive.auto.convert.join=false; > set hive.tez.auto.reducer.parallelism=true; > EXPLAIN SELECT * FROM ( > SELECT k, COUNT(DISTINCT v), SUM(v) > FROM t_asj_18 GROUP BY k > ) a LEFT JOIN ( > SELECT k, COUNT(v) > FROM t_asj_18 GROUP BY k > ) b ON a.k = b.k; {code} > > > Expected result is: > > {code:java} > a 1 20 a 2 {code} > but on master branch, it results in > > > {code:java} > a 1 20 NULL NULL {code} > > > Here for COUNT(DISTINCT), the RS key is k, v while partition is still k. In > such scenario [reducer trait UNIFORM is not > set|[https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/SetReducerParallelism.java#L99-L104].] > The hash code for "a" from 2nd subquery is generated using murmurHash > (270516725) while 1st is generated using bucketHash (1086686554) and result > in rows with "a" key reaching different reducer tasks. -- This message was sent by Atlassian Jira (v8.20.10#820010)