Hi, I know this question might have been beaten to death, but I could not find an answer to a particular question. I'm using hive 0.10.x
I have a table partitioned on day and I would like to bucket the table on a different column to avail of the SMB join optimization. I have seen an earlier thread around determining the number of buckets. http://mail-archives.apache.org/mod_mbox/hive-user/201204.mbox/%3c350967547.114894.1335230199385.javamail.r...@sms-zimbra-message-store-03.sms.scalar.ca%3E It looks like for my purposes 4 or 8 buckets should suffice for my current data load. But the volume is projected to increase significantly. This I see is causing a problem. What I see is hive in a multi stage query in the last stage creates as many reducers as the number of buckets defined in the table spec. When the volume increases it still has to use this predefined number to distribute all the data and it runs into an out of memory error. Now I could conservatively increase the bucket size to 16 or even 32. But for days when the volume is low, it starts creating too many small files. Now I have conundrum of balancing the small files and join optimization. I wanted to see if you guys have any suggestions around this. I'm thinking if it is generating such small files, then may be SMB join optimization is not necessary. Is that observation correct? Would appreciate your inputs. Thanks K