Re: Re: multiple tables join with only one hug table.

Koert Kuipers Sat, 13 Aug 2011 17:58:35 -0700

I am not aware of any optimization that does something like that. Anyone?
Also your suggestion means 10 hash tables would have to be in memory.


I think that with a normal map-reduce join in hive you can join 10 tables at
once (meaning in a single map-reduce) if they all join on the same key.

2011/8/13 Daniel,Wu <hadoop...@163.com>

> Thanks, it works, but not as effective as possible:
>
> suppose we join 10 small tables (s1,s2...s10) with one huge table (big) in
> a database house system (the join is between big table and small table, like
> star schema), after I set the parameters as you set, it will have 10 mapside
> join, after one mapside join competes, it will write huge data to file
> system (as one table is huge), then the next mapside join need to read the
> hug data written to do another mapside join, so totally we need to read the
> huge data 11 times and write it 10 times（as the last write only return small
> data volume).   The best execution plan I can think of is: first build 10
> hash table: one for each small table, and loop each row in the big table, if
> the row survive, just output, if not then discard, in this way we only need
> to read the big data once, instead of read big data, write big data, read
> big data, ...
>
> flow is:
> 1: build 10 hash table
> 2: foreach row in big table
>          probe the row with each of these 10 hash table
>          if match all these 10 hash table, go to next step (output, etc)
>          else discard the row.
>     end loop
>
>
> At 2011-08-13 01:17:16,"Koert Kuipers" <ko...@tresata.com> wrote:
>
> A mapjoin does what you described: it builds hash tables for the smaller
> tables. In recent versions of hive (like the one i am using with cloudera
> cdh3u1) a mapjoin will be done for you automatically if you have your
> parameters set correctly. The relevant parameters in hive-site.xml are:
> hive.auto.convert.join, hive.mapjoin.maxsize and
> hive.mapjoin.smalltable.filesize. On the hive command line it will tell you
> that it is building the hashtable, and it will not run a reducer.
>
> On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha <ayonsi...@yahoo.com> wrote:
>
>> The Mapjoin hint syntax help optimize by loading the smaller tables
>> specified in the Mapjoin hint into memory. Then every small table is in
>> memory of each mapper.
>>
>> -Ayon
>> See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
>> Also check out my Blog for answers to commonly asked 
>> questions.<http://dailyadvisor.blogspot.com>
>>
>> ------------------------------
>> *From:* "Daniel,Wu" <hadoop...@163.com>
>> *To:* hive <user@hive.apache.org>
>> *Sent:* Thursday, August 11, 2011 7:01 PM
>> *Subject:* multiple tables join with only one hug table.
>>
>> if the retailer fact table is sale_fact with 10B rows, and join with 3
>> small tables: stores (10K), products(10K), period (1K). What's the best join
>> solution?
>>
>> In oracle, it can first build hash for stores, and hash for products, and
>> hash for stores. Then probe using the fact table, if the row matched in
>> stores, that row can go up further to map with products by hashing check, if
>> pass, then go up further to try to match period. In this way, the sale_fact
>> only needs to be scanned once which save lots of disk IO.  Is this doable in
>> hive, if doable, what hint need to use?
>>
>>
>>
>>
>>
>
>
>

Re: Re: multiple tables join with only one hug table.

Reply via email to