wants to create a JIRA (request): multiple tables join with only one hug table.

Daniel,Wu Sun, 14 Aug 2011 18:07:30 -0700

Hi everyone,

I'd like to create a change request (or JIRA, not sure), do you think it's 
feasible? And I search the document about how to contribute, but can't find a 
way about how to create a request, could anyone point me to the document?



At 2011-08-14 17:08:26,"Daniel,Wu" <hadoop...@163.com> wrote:

a simple usage:  for retailer data, which keep 10 years of data, that's 10 * 
365 =3650 records in the calendar dimension, if there are 8000 stores and 8000 
products, totally the sales will have 8000  * 8000 * 3650 =233,600,000,000 
records if we has one record for each product/day/store combination. And to 
join the fact data (sales) with product, day, store, it will be much effective 
to hash product, day, store, and then probe use the sales as it will only scan 
the sales table once.

If it is not possible in the current implementation, how to do you think to 
create a request to implement it in the future?


At 2011-08-14 08:58:03,"Koert Kuipers" <ko...@tresata.com> wrote:
I am not aware of any optimization that does something like that. Anyone?
Also your suggestion means 10 hash tables would have to be in memory.

I think that with a normal map-reduce join in hive you can join 10 tables at 
once (meaning in a single map-reduce) if they all join on the same key.


2011/8/13 Daniel,Wu<hadoop...@163.com>

Thanks, it works, but not as effective as possible:

suppose we join 10 small tables (s1,s2...s10) with one huge table (big) in a 
database house system (the join is between big table and small table, like star 
schema), after I set the parameters as you set, it will have 10 mapside join, 
after one mapside join competes, it will write huge data to file system (as one 
table is huge), then the next mapside join need to read the hug data written to 
do another mapside join, so totally we need to read the huge data 11 times and 
write it 10 times（as the last write only return small data volume).   The best 
execution plan I can think of is: first build 10 hash table: one for each small 
table, and loop each row in the big table, if the row survive, just output, if 
not then discard, in this way we only need to read the big data once, instead 
of read big data, write big data, read big data, ...

flow is:
1: build 10 hash table
2: foreach row in big table
         probe the row with each of these 10 hash table
         if match all these 10 hash table, go to next step (output, etc)
         else discard the row.
    end loop



At 2011-08-13 01:17:16,"Koert Kuipers" <ko...@tresata.com> wrote:
A mapjoin does what you described: it builds hash tables for the smaller 
tables. In recent versions of hive (like the one i am using with cloudera 
cdh3u1) a mapjoin will be done for you automatically if you have your 
parameters set correctly. The relevant parameters in hive-site.xml are: 
hive.auto.convert.join, hive.mapjoin.maxsize and 
hive.mapjoin.smalltable.filesize. On the hive command line it will tell you 
that it is building the hashtable, and it will not run a reducer.


On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha<ayonsi...@yahoo.com> wrote:

The Mapjoin hint syntax help optimize by loading the smaller tables specified 
in the Mapjoin hint into memory. Then every small table is in memory of each 
mapper.
 
-Ayon
See My Photos on Flickr
Also check out my Blog for answers to commonly asked questions.


From: "Daniel,Wu" <hadoop...@163.com>
To: hive <user@hive.apache.org>
Sent: Thursday, August 11, 2011 7:01 PM
Subject: multiple tables join with only one hug table.



if the retailer fact table is sale_fact with 10B rows, and join with 3 small 
tables: stores (10K), products(10K), period (1K). What's the best join solution?

In oracle, it can first build hash for stores, and hash for products, and hash 
for stores. Then probe using the fact table, if the row matched in stores, that 
row can go up further to map with products by hashing check, if pass, then go 
up further to try to match period. In this way, the sale_fact only needs to be 
scanned once which save lots of disk IO.  Is this doable in hive, if doable, 
what hint need to use?

wants to create a JIRA (request): multiple tables join with only one hug table.

Reply via email to