Sort Merge Bucket Joins

Abbass MAROUNI Wed, 21 May 2014 05:27:08 -0700

Hi all,

I'm trying to understand the different HIVE join optimizations. I got the
idea that we're trying to limit the shuffling of key value pairs from
mappers to reducers. But, I cannot grasp the idea behind SMB joins.


For example :

Table A with four columns (user_id, col2, col3, col4) bucketed into 4
buckets on user_id.
Table  B with three columns(user_id, col2, col3) bucketed into 4 buckets on
user_id.

Table A is very large compared to Table B.
The 4 buckets of Table A are files in HDFS and each is made up of multiple
blocks.

When I join Table A and B on user_id, Hive will launch a MR job where :
Each bucket of Table A is read by multiple mappers (according to the number
of blocks making up this bucket/file)
Each one of those mappers will fetch the corresponding bucket (probably
made up of multiple blocks) from Table B (network read ?) and they'll look
for matching records.

Did I get right ?

What happens when each bucket of Table B is made up of multiple blocks ?

Best Regards,

Sort Merge Bucket Joins

Reply via email to