A mapjoin does what you described: it builds hash tables for the smaller
tables. In recent versions of hive (like the one i am using with cloudera
cdh3u1) a mapjoin will be done for you automatically if you have your
parameters set correctly. The relevant parameters in hive-site.xml are:
hive.auto.convert.join, hive.mapjoin.maxsize and
hive.mapjoin.smalltable.filesize. On the hive command line it will tell you
that it is building the hashtable, and it will not run a reducer.

On Thu, Aug 11, 2011 at 10:25 PM, Ayon Sinha <ayonsi...@yahoo.com> wrote:

> The Mapjoin hint syntax help optimize by loading the smaller tables
> specified in the Mapjoin hint into memory. Then every small table is in
> memory of each mapper.
>
> -Ayon
> See My Photos on Flickr <http://www.flickr.com/photos/ayonsinha/>
> Also check out my Blog for answers to commonly asked 
> questions.<http://dailyadvisor.blogspot.com>
>
> ------------------------------
> *From:* "Daniel,Wu" <hadoop...@163.com>
> *To:* hive <user@hive.apache.org>
> *Sent:* Thursday, August 11, 2011 7:01 PM
> *Subject:* multiple tables join with only one hug table.
>
> if the retailer fact table is sale_fact with 10B rows, and join with 3
> small tables: stores (10K), products(10K), period (1K). What's the best join
> solution?
>
> In oracle, it can first build hash for stores, and hash for products, and
> hash for stores. Then probe using the fact table, if the row matched in
> stores, that row can go up further to map with products by hashing check, if
> pass, then go up further to try to match period. In this way, the sale_fact
> only needs to be scanned once which save lots of disk IO.  Is this doable in
> hive, if doable, what hint need to use?
>
>
>
>
>

Reply via email to