I am not aware of any optimization that does something like that. Anyone? Also your suggestion means 10 hash tables would have to be in memory.
I think that with a normal map-reduce join in hive you can join 10 tables at once (meaning in a single map-reduce) if they all join on the same key. 2011/8/13 Daniel,Wu <hadoop...@163.com> > Thanks, it works, but not as effective as possible: > > suppose we join 10 small tables (s1,s2...s10) with one huge table (big) in > a database house system (the join is between big table and small table, like > star schema), after I set the parameters as you set, it will have 10 mapside > join, after one mapside join competes, it will write huge data to file > system (as one table is huge), then the next mapside join need to read the > hug data written to do another mapside join, so totally we need to read the > huge data 11 times and write it 10 times(as the last write only return small > data volume). The best execution plan I can think of is: first build 10 > hash table: one for each small table, and loop each row in the big table, if > the row survive, just output, if not then discard, in this way we only need > to read the big data once, instead of read big data, write big data, read > big data, ... > > flow is: > 1: build 10 hash table > 2: foreach row in big table > probe the row with each of these 10 hash table > if match all these 10 hash table, go to next step (output, etc) > else discard the row. > end loop > > > At 2011-08-13 01:17:16,"Koert Kuipers" <ko...@tresata.com> wrote: > > A mapjoin does what you described: it builds hash tables for the smaller > tables. In recent versions of hive (like the one i am using with cloudera > cdh3u1) a mapjoin will be done for you automatically if you have your > parameters set correctly. The relevant parameters in hive-site.xml are: > hive.auto.convert.join, hive.mapjoin.maxsize and > hive.mapjoin.smalltable.filesize. On the hive command line it will tell you > that it is building the hashtable, and it will not run a reducer. > > On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha <ayonsi...@yahoo.com> wrote: > >> The Mapjoin hint syntax help optimize by loading the smaller tables >> specified in the Mapjoin hint into memory. Then every small table is in >> memory of each mapper. >> >> -Ayon >> See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/> >> Also check out my Blog for answers to commonly asked >> questions.<http://dailyadvisor.blogspot.com> >> >> ------------------------------ >> *From:* "Daniel,Wu" <hadoop...@163.com> >> *To:* hive <user@hive.apache.org> >> *Sent:* Thursday, August 11, 2011 7:01 PM >> *Subject:* multiple tables join with only one hug table. >> >> if the retailer fact table is sale_fact with 10B rows, and join with 3 >> small tables: stores (10K), products(10K), period (1K). What's the best join >> solution? >> >> In oracle, it can first build hash for stores, and hash for products, and >> hash for stores. Then probe using the fact table, if the row matched in >> stores, that row can go up further to map with products by hashing check, if >> pass, then go up further to try to match period. In this way, the sale_fact >> only needs to be scanned once which save lots of disk IO. Is this doable in >> hive, if doable, what hint need to use? >> >> >> >> >> > > >