[ https://issues.apache.org/jira/browse/HIVE-17018?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16088721#comment-16088721 ]
Carter Shanklin commented on HIVE-17018: ---------------------------------------- My inputs: That particular variable can't be renamed to something spark specific since all engines use it Adding a net new variable for Spark would increase confusion rather than decrease it. It would be good to have some sort of descriptive name that applies to both Tez and MR. As pointed out there is no relation between what that variable used to do and what it does today, and the implication of changing that parameter is difficult to guess. Maybe a new variable like hive.auto.convert.join.max.hashtable.size could be introduced. Both engines switch to that variable at some point, then usage of the old variable could be deprecated and then removed. Just my inputs. /cc [~ashutoshc] > Small table is converted to map join even the total size of small tables > exceeds the threshold(hive.auto.convert.join.noconditionaltask.size) > --------------------------------------------------------------------------------------------------------------------------------------------- > > Key: HIVE-17018 > URL: https://issues.apache.org/jira/browse/HIVE-17018 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > Attachments: HIVE-17018_data_init.q, HIVE-17018.q, t3.txt > > > we use "hive.auto.convert.join.noconditionaltask.size" as the threshold. it > means the sum of size for n-1 of the tables/partitions for a n-way join is > smaller than it, it will be converted to a map join. for example, A join B > join C join D join E. Big table is A(100M), small tables are > B(10M),C(10M),D(10M),E(10M). If we set > hive.auto.convert.join.noconditionaltask.size=20M. In current code, E,D,B > will be converted to map join but C will not be converted to map join. In my > understanding, because hive.auto.convert.join.noconditionaltask.size can only > contain E and D, so C and B should not be converted to map join. > Let's explain more why E can be converted to map join. > in current code, > [SparkMapJoinOptimizer#getConnectedMapJoinSize|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L364] > calculates all the mapjoins in the parent path and child path. The search > stops when encountering [UnionOperator or > ReduceOperator|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L381]. > Because C is not converted to map join because {{connectedMapJoinSize + > totalSize) > maxSize}} [see > code|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#L330].The > RS before the join of C remains. When calculating whether B will be > converted to map join, {{getConnectedMapJoinSize}} returns 0 as encountering > [RS > |https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SparkMapJoinOptimizer.java#409] > and causes {{connectedMapJoinSize + totalSize) < maxSize}} matches. > [~xuefuz] or [~jxiang]: can you help see whether this is a bug or not as you > are more familiar with SparkJoinOptimizer. -- This message was sent by Atlassian JIRA (v6.4.14#64029)