Why does SMB join generate hash table locally, even if input tables are large?

Pala M Muthaia Tue, 29 Jul 2014 13:57:13 -0700

Hi,

I am testing SMB join for 2 large tables. The tables are bucketed and
sorted on the join column. I notice that even though the table is large,
Hive attempts to generate hash table for the 'small' table locally,
 similar to map join. Since the table is large in my case, the client runs
out of memory and the query fails.


I am using Hive 0.12 with the following settings:

set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.input.format =
org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;

My test query does a simple join and a select, no subqueries/nested queries
etc.

I understand why a (bucket) map join requires hash table generation, but
why is that included for an SMB join? Shouldn't a SMB join just spin up one
mapper for each bucket and perform a sort merge join directly on the mapper?


Thanks,
pala

Why does SMB join generate hash table locally, even if input tables are large?

Reply via email to