A mapjoin does what you described: it builds hash tables for the smaller tables. In recent versions of hive (like the one i am using with cloudera cdh3u1) a mapjoin will be done for you automatically if you have your parameters set correctly. The relevant parameters in hive-site.xml are: hive.auto.convert.join, hive.mapjoin.maxsize and hive.mapjoin.smalltable.filesize. On the hive command line it will tell you that it is building the hashtable, and it will not run a reducer.
On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha <ayonsi...@yahoo.com> wrote: > The Mapjoin hint syntax help optimize by loading the smaller tables > specified in the Mapjoin hint into memory. Then every small table is in > memory of each mapper. > > -Ayon > See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/> > Also check out my Blog for answers to commonly asked > questions.<http://dailyadvisor.blogspot.com> > > ------------------------------ > *From:* "Daniel,Wu" <hadoop...@163.com> > *To:* hive <user@hive.apache.org> > *Sent:* Thursday, August 11, 2011 7:01 PM > *Subject:* multiple tables join with only one hug table. > > if the retailer fact table is sale_fact with 10B rows, and join with 3 > small tables: stores (10K), products(10K), period (1K). What's the best join > solution? > > In oracle, it can first build hash for stores, and hash for products, and > hash for stores. Then probe using the fact table, if the row matched in > stores, that row can go up further to map with products by hashing check, if > pass, then go up further to try to match period. In this way, the sale_fact > only needs to be scanned once which save lots of disk IO. Is this doable in > hive, if doable, what hint need to use? > > > > >