Re: Question on bucketed map join

Bejoy Ks Wed, 25 Jan 2012 09:28:38 -0800

Hi Avrilia
        Ideally map join is used when one of the table involved in join has 
only small amount of data that is good to be loaded in memory as a hash table 
(uses distributed cache).

Bucketed map join can be used even if none of the tables are small enough to be 
loaded on to memory. 

Bucketed map join can be used only if both the tables are bucketed on the same 
set of columns involved in join and the number of buckets in each table should 
be multiples of other.

On the execution front both executes differently. If you enable a map join, as 
mentioned the data is loaded into hash table and execution goes in that 
direction. 
set hive.auto.convert.join = true;

In the hive console you can see something similar to

Total MapReduce jobs = 3
Ended Job = 1109753214, job is filtered out (removed at runtime).
Ended Job = 1182417955, job is filtered out (removed at runtime).
Execution log at: 
/tmp/bejoy/bejoy_20120125121010_854cecf7-89cd-4de2-8d06-fced7e2a611a.log
2012-01-25 12:11:01     Starting to launch local task to process map join;      
maximum memory = 932118528
2012-01-25 12:11:01     Processing rows:        4       Hashtable size: 4       
Memory usage:   2028816 rate:   0.002
2012-01-25 12:11:01     Dump the hashtable into file: 
file:/tmp/bejoy/hive_2012-01-25_12-10-59_457_5718846433993780287/-local-10002/HashTable-Stage-3/MapJoin-1--.hashtable
2012-01-25 12:11:01     Upload 1 File to: 
file:/tmp/bejoy/hive_2012-01-25_12-10-59_457_5718846433993780287/-local-10002/HashTable-Stage-3/MapJoin-1--.hashtable
 File size: 585
2012-01-25 12:11:01     End of local task; Time Taken: 0.514 sec.
Mapred Local Task Succeeded . Convert the Join into MapJoin

Instead if you are enabling a bucketed map join, you won't see the following in 
hive console like the filtering of MR jobs at runtime,hash map creation, status 
of local map task etc. It simply executes a Map reduce job.
set hive.optimize.bucketmapjoin = true;

As explained the EXPLAIN would definitely give different Query plan for both.

Now if you enable both the joins

set hive.auto.convert.join = true;

set hive.optimize.bucketmapjoin = true;

Hive would execute the same as a map join only not as a bucketed map join.

Hope it helps!...

Regards
Bejoy.K.S

________________________________
 From: Avrilia Floratou <flora...@cs.wisc.edu>
To: user@hive.apache.org; bejoy...@yahoo.com 
Sent: Tuesday, January 24, 2012 10:39 PM
Subject: Re: Question on bucketed map join

Hi Bejoy,

Thanks a lot for your help:)  I'm still a little confused.

In my script I always have the
set.auto.convert.join = true.

Then I did what you suggested:

1st -> set hive.optimize.bucketmapjoin=true
2nd ->  set hive.optimize.bucketmapjoin=false

I ran explain with these 2 options and the output is the same. No difference in 
the plan….

Regarding my 2nd comment about hive output in the 2 cases being same or 
different, I was not referring to the actual result which should be the same in 
both cases
but the logs that hive prints. For example when a normal map-side join is 
executed I can see that first a hash table is created, the time it took to 
create and distribute it and so on. I was
asking if the log structure should be the exact same when a bucketed map join 
is executed.

Thanks a lot,
Avrilia

On Jan 19, 2012, at 11:34 AM, bejoy...@yahoo.com wrote:

Corrected a few typos in previous mail
>
>Hi Avrila
>Hi Avrila
>       AFAIK the bucketed map join is not default in hive and it happens only 
>when the configuration parameter hive.optimize.bucketmapjoin  is set to true. 
>You may be getting the same execution plan because hive.optimize.bucketmapjoin 
> is set to true  in the hive configuration xml file. To cross confirm the same 
>could you explicitly set this to false
>(set hive.optimize.bucketmapjoin = false;
>) in your hive session and get the query execution plan from explain command. 
>Please find some pointers in line
>1. Should I see sth different in the explain extended output if I set and 
>unset the hive.optimize.bucketmapjoin option?
>[Bejoy]Yes, you should be seeing different plans for both.
>Try EXPLAIN your join query after setting this
>set hive.optimize.bucketmapjoin = false;
>
>2. Should I see something different in the output of hive while running the 
>query if again I set and unset the hive.optimize.bucketmapjoin?
>[Bejoy] No,Hive output should be the same. What ever is the execution plan for 
>an join, optimally the end result should be same.
>
>3. Is it possible that even though I set bucketmapjoin to true, Hive will 
>still perform a normal map-side join for some reason? How can I check if this 
>has actually happened?
>[Bejoy] Hive would perform a plain map side join only if the following 
>parameter is enabled. (default it is disabled)
>set hive.auto.convert.join = true; you need to check this value in your 
>configurations.
>If it is enabled irrespective of the table size hive would always try a map 
>join, it would come to a normal join only after the map join attempt fails.
>AFAIK, if the number of buckets are same or multiples between the two tables 
>involved in a join and if the join is on the same columns that are bucketed, 
>with bucketmapjoin enabled it shouldn't execute a plain mapside join but a 
>bucketed map side join would be triggered.
>
>Hope it helps!..
>
>
>Regards
>Bejoy K S
>________________________________
>
>From:  Bejoy Ks <bejoy...@yahoo.com> 
>Date: Thu, 19 Jan 2012 09:22:08 -0800 (PST)
>To: user@hive.apache.org<user@hive.apache.org>
>ReplyTo:  user@hive.apache.org 
>Subject: Re: Question on bucketed map join
>
>
>Hi Avrila
>       AFAIK the bucketed map join is not default in hive and it happens only 
>when the values is set to true. It could be because the same value is already 
>set in the hive configuration xml file. To cross confirm the same could you 
>explicitly set this to false 
>
>(set hive.optimize.bucketmapjoin = false;)and get the query execution plan 
>from explain command. 
>
>
>
>Please some pointers in line
>
>
>1. Should I see sth different in the explain extended output if I set and 
>unset the hive.optimize.bucketmapjoin option?
>[Bejoy] you should be seeing the same
>Try EXPLAIN your join query after setting this
>set hive.optimize.bucketmapjoin = false;
>
>
>
>2. Should I see something different in the output of hive while running 
the query if again I set and unset the hive.optimize.bucketmapjoin?
>[Bejoy] No,Hive output should be the same. What ever is the execution plan for 
>an join, optimally the end result should be same. 
>
>
>3.
 Is it possible that even though I set bucketmapjoin to true, Hive will 
still perform a normal map-side join for some reason? How can I check if
 this has actually happened?
>[Bejoy] Hive would perform a plain map side join only if the following 
>parameter is enabled. (default it is disabled)
>
>set hive.auto.convert.join = true; you need to check this value in your 
>configurations.
>If it is enabled irrespective of the table size hive would always try a map 
>join, it would come to a normal join only after the map join attempt fails.
>AFAIK, if the number of buckets are same or multiples between the two tables 
>involved in a join and if the join is on the same columns that are bucketed, 
>with bucketmapjoin enabled it shouldn't execute a plain mapside join a 
>bucketed map side join would be triggered.
>
>
>Hope it helps!..
>
>
>Regards
>Bejoy.K.S
>
>
>
>
>________________________________
> From: Avrilia Floratou <flora...@cs.wisc.edu>
>To: user@hive.apache.org 
>Sent: Thursday, January 19, 2012 9:23 PM
>Subject: Question on bucketed map join
> 
>Hi,
>
>I have two tables with 8 buckets each on the same key and want to join them.
>I ran "explain extended" and get the plan produced by HIVE which shows that a 
>map-side join is a possible plan.
>
>I then set in my script the hive.optimize.bucketmapjoin option to true and 
>reran the "explain extended" query. I get the exact same plans as output.
>
>I ran the query with and without the bucketmapjoin optimization and saw no 
>difference in the running time.
>
>I have the following questions:
>
>1. Should I see sth different in the explain extended output if I set and 
>unset the hive.optimize.bucketmapjoin option?
>
>2. Should I see something different in the output of hive while running the 
>query if again I set and unset the hive.optimize.bucketmapjoin?
>
>3. Is it possible that even though I set bucketmapjoin to true, Hive will 
>still perform a normal map-side join for some reason? How can I check if this 
>has actually
 happened?
>
>Thanks,
>Avrilia 
>
>

Re: Question on bucketed map join

Reply via email to