Hi all, I'm trying to understand the different HIVE join optimizations. I got the idea that we're trying to limit the shuffling of key value pairs from mappers to reducers. But, I cannot grasp the idea behind SMB joins.
For example : Table A with four columns (user_id, col2, col3, col4) bucketed into 4 buckets on user_id. Table B with three columns(user_id, col2, col3) bucketed into 4 buckets on user_id. Table A is very large compared to Table B. The 4 buckets of Table A are files in HDFS and each is made up of multiple blocks. When I join Table A and B on user_id, Hive will launch a MR job where : Each bucket of Table A is read by multiple mappers (according to the number of blocks making up this bucket/file) Each one of those mappers will fetch the corresponding bucket (probably made up of multiple blocks) from Table B (network read ?) and they'll look for matching records. Did I get right ? What happens when each bucket of Table B is made up of multiple blocks ? Best Regards,