Hi, I am testing SMB join for 2 large tables. The tables are bucketed and sorted on the join column. I notice that even though the table is large, Hive attempts to generate hash table for the 'small' table locally, similar to map join. Since the table is large in my case, the client runs out of memory and the query fails.
I am using Hive 0.12 with the following settings: set hive.optimize.bucketmapjoin=true; set hive.optimize.bucketmapjoin.sortedmerge=true; set hive.input.format = org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; My test query does a simple join and a select, no subqueries/nested queries etc. I understand why a (bucket) map join requires hash table generation, but why is that included for an SMB join? Shouldn't a SMB join just spin up one mapper for each bucket and perform a sort merge join directly on the mapper? Thanks, pala