Hello Team. I have a small problem with MapJoin. I have a large 20G table, T1. I have 2 small tables of 800M, T2 and T3. If I execute the following SQL.
SET hive.auto.convert.join.noconditionaltask.size = 1000000000; -- 1GB explain select count(1) from t1 --20G join t2 --800m on t1.uni_order_id = t2.uni_order_id join t3 --800m on t1.uni_order_id = t3.uni_order_id; According to the explanation of the parameter hive.auto.convert.join.noconditionaltask.size, the size of T2+T3 table is 1.6GB which is larger than the noconditionaltask.size(1GB) I set. But the execution plan still broadcasts the T2 T3 table, using a MAP JOIN. Explain Map 1 <- Map 3 (BROADCAST_EDGE), Map 4 (BROADCAST_EDGE) Explain Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) I don't quite understand why. PS: CODE: hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java at 89e7d4a31b32317188f91aed8ce30e0d36600acc · apache/hive This looks like using the HDFS file size directly to calculate whether to use MAPJOIN. DOC: If hive.auto.convert.join.noconditionaltask is off, this parameter does not take affect. However, if it is on, and the sum of size for n-1 of the tables/partitions for a n-way join is smaller than this size, the join is directly converted to a mapjoin(there is no conditional task). The default is 10MB;