a simple usage: for retailer data, which keep 10 years of data, that's 10 *
365 =3650 records in the calendar dimension, if there are 8000 stores and 8000
products, totally the sales will have 8000 * 8000 * 3650 =233,600,000,000
records if we has one record for each product/day/store combinati
I am not aware of any optimization that does something like that. Anyone?
Also your suggestion means 10 hash tables would have to be in memory.
I think that with a normal map-reduce join in hive you can join 10 tables at
once (meaning in a single map-reduce) if they all join on the same key.
2011
A mapjoin does what you described: it builds hash tables for the smaller
tables. In recent versions of hive (like the one i am using with cloudera
cdh3u1) a mapjoin will be done for you automatically if you have your
parameters set correctly. The relevant parameters in hive-site.xml are:
hive.auto.
The Mapjoin hint syntax help optimize by loading the smaller tables specified
in the Mapjoin hint into memory. Then every small table is in memory of each
mapper.
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.
Fr