Hello Team.
I have a small problem with MapJoin.
I have a large 20G table, T1.
I have 2 small tables of 800M, T2 and T3.
If I execute the following SQL.

SET hive.auto.convert.join.noconditionaltask.size = 1000000000; -- 1GB
explain
select count(1)
from  t1 --20G
join 
 t2  --800m
on t1.uni_order_id = t2.uni_order_id
join 
 t3   --800m
on t1.uni_order_id = t3.uni_order_id;
According to the explanation of the parameter 
hive.auto.convert.join.noconditionaltask.size, the size of T2+T3 table is 1.6GB 
which is larger than the noconditionaltask.size(1GB) I set.
But the execution plan still broadcasts the T2 T3 table, using a MAP JOIN.

Explain  Map 1 <- Map 3 (BROADCAST_EDGE), Map 4 (BROADCAST_EDGE)

Explain  Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
I don't quite understand why.

PS:
CODE: 
hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/CommonJoinTaskDispatcher.java
 at 89e7d4a31b32317188f91aed8ce30e0d36600acc · apache/hive
This looks like using the HDFS file size directly to calculate whether to use 
MAPJOIN.
DOC:

If hive.auto.convert.join.noconditionaltask is off, this parameter does not 
take affect. However, if it is on, and the sum of size for n-1 of the 
tables/partitions for a n-way join is smaller than this size, the join is 
directly converted to a mapjoin(there is no conditional task). The default is 
10MB;

Reply via email to