Hi Avrilia Ideally map join is used when one of the table involved in join has only small amount of data that is good to be loaded in memory as a hash table (uses distributed cache).
Bucketed map join can be used even if none of the tables are small enough to be loaded on to memory. Bucketed map join can be used only if both the tables are bucketed on the same set of columns involved in join and the number of buckets in each table should be multiples of other. On the execution front both executes differently. If you enable a map join, as mentioned the data is loaded into hash table and execution goes in that direction. set hive.auto.convert.join = true; In the hive console you can see something similar to Total MapReduce jobs = 3 Ended Job = 1109753214, job is filtered out (removed at runtime). Ended Job = 1182417955, job is filtered out (removed at runtime). Execution log at: /tmp/bejoy/bejoy_20120125121010_854cecf7-89cd-4de2-8d06-fced7e2a611a.log 2012-01-25 12:11:01 Starting to launch local task to process map join; maximum memory = 932118528 2012-01-25 12:11:01 Processing rows: 4 Hashtable size: 4 Memory usage: 2028816 rate: 0.002 2012-01-25 12:11:01 Dump the hashtable into file: file:/tmp/bejoy/hive_2012-01-25_12-10-59_457_5718846433993780287/-local-10002/HashTable-Stage-3/MapJoin-1--.hashtable 2012-01-25 12:11:01 Upload 1 File to: file:/tmp/bejoy/hive_2012-01-25_12-10-59_457_5718846433993780287/-local-10002/HashTable-Stage-3/MapJoin-1--.hashtable File size: 585 2012-01-25 12:11:01 End of local task; Time Taken: 0.514 sec. Mapred Local Task Succeeded . Convert the Join into MapJoin Instead if you are enabling a bucketed map join, you won't see the following in hive console like the filtering of MR jobs at runtime,hash map creation, status of local map task etc. It simply executes a Map reduce job. set hive.optimize.bucketmapjoin = true; As explained the EXPLAIN would definitely give different Query plan for both. Now if you enable both the joins set hive.auto.convert.join = true; set hive.optimize.bucketmapjoin = true; Hive would execute the same as a map join only not as a bucketed map join. Hope it helps!... Regards Bejoy.K.S ________________________________ From: Avrilia Floratou <flora...@cs.wisc.edu> To: user@hive.apache.org; bejoy...@yahoo.com Sent: Tuesday, January 24, 2012 10:39 PM Subject: Re: Question on bucketed map join Hi Bejoy, Thanks a lot for your help:) I'm still a little confused. In my script I always have the set.auto.convert.join = true. Then I did what you suggested: 1st -> set hive.optimize.bucketmapjoin=true 2nd -> set hive.optimize.bucketmapjoin=false I ran explain with these 2 options and the output is the same. No difference in the plan…. Regarding my 2nd comment about hive output in the 2 cases being same or different, I was not referring to the actual result which should be the same in both cases but the logs that hive prints. For example when a normal map-side join is executed I can see that first a hash table is created, the time it took to create and distribute it and so on. I was asking if the log structure should be the exact same when a bucketed map join is executed. Thanks a lot, Avrilia On Jan 19, 2012, at 11:34 AM, bejoy...@yahoo.com wrote: Corrected a few typos in previous mail > >Hi Avrila >Hi Avrila > AFAIK the bucketed map join is not default in hive and it happens only >when the configuration parameter hive.optimize.bucketmapjoin is set to true. >You may be getting the same execution plan because hive.optimize.bucketmapjoin > is set to true in the hive configuration xml file. To cross confirm the same >could you explicitly set this to false >(set hive.optimize.bucketmapjoin = false; >) in your hive session and get the query execution plan from explain command. >Please find some pointers in line >1. Should I see sth different in the explain extended output if I set and >unset the hive.optimize.bucketmapjoin option? >[Bejoy]Yes, you should be seeing different plans for both. >Try EXPLAIN your join query after setting this >set hive.optimize.bucketmapjoin = false; > >2. Should I see something different in the output of hive while running the >query if again I set and unset the hive.optimize.bucketmapjoin? >[Bejoy] No,Hive output should be the same. What ever is the execution plan for >an join, optimally the end result should be same. > >3. Is it possible that even though I set bucketmapjoin to true, Hive will >still perform a normal map-side join for some reason? How can I check if this >has actually happened? >[Bejoy] Hive would perform a plain map side join only if the following >parameter is enabled. (default it is disabled) >set hive.auto.convert.join = true; you need to check this value in your >configurations. >If it is enabled irrespective of the table size hive would always try a map >join, it would come to a normal join only after the map join attempt fails. >AFAIK, if the number of buckets are same or multiples between the two tables >involved in a join and if the join is on the same columns that are bucketed, >with bucketmapjoin enabled it shouldn't execute a plain mapside join but a >bucketed map side join would be triggered. > >Hope it helps!.. > > >Regards >Bejoy K S >________________________________ > >From: Bejoy Ks <bejoy...@yahoo.com> >Date: Thu, 19 Jan 2012 09:22:08 -0800 (PST) >To: user@hive.apache.org<user@hive.apache.org> >ReplyTo: user@hive.apache.org >Subject: Re: Question on bucketed map join > > >Hi Avrila > AFAIK the bucketed map join is not default in hive and it happens only >when the values is set to true. It could be because the same value is already >set in the hive configuration xml file. To cross confirm the same could you >explicitly set this to false > >(set hive.optimize.bucketmapjoin = false;)and get the query execution plan >from explain command. > > > >Please some pointers in line > > >1. Should I see sth different in the explain extended output if I set and >unset the hive.optimize.bucketmapjoin option? >[Bejoy] you should be seeing the same >Try EXPLAIN your join query after setting this >set hive.optimize.bucketmapjoin = false; > > > >2. Should I see something different in the output of hive while running the query if again I set and unset the hive.optimize.bucketmapjoin? >[Bejoy] No,Hive output should be the same. What ever is the execution plan for >an join, optimally the end result should be same. > > >3. Is it possible that even though I set bucketmapjoin to true, Hive will still perform a normal map-side join for some reason? How can I check if this has actually happened? >[Bejoy] Hive would perform a plain map side join only if the following >parameter is enabled. (default it is disabled) > >set hive.auto.convert.join = true; you need to check this value in your >configurations. >If it is enabled irrespective of the table size hive would always try a map >join, it would come to a normal join only after the map join attempt fails. >AFAIK, if the number of buckets are same or multiples between the two tables >involved in a join and if the join is on the same columns that are bucketed, >with bucketmapjoin enabled it shouldn't execute a plain mapside join a >bucketed map side join would be triggered. > > >Hope it helps!.. > > >Regards >Bejoy.K.S > > > > >________________________________ > From: Avrilia Floratou <flora...@cs.wisc.edu> >To: user@hive.apache.org >Sent: Thursday, January 19, 2012 9:23 PM >Subject: Question on bucketed map join > >Hi, > >I have two tables with 8 buckets each on the same key and want to join them. >I ran "explain extended" and get the plan produced by HIVE which shows that a >map-side join is a possible plan. > >I then set in my script the hive.optimize.bucketmapjoin option to true and >reran the "explain extended" query. I get the exact same plans as output. > >I ran the query with and without the bucketmapjoin optimization and saw no >difference in the running time. > >I have the following questions: > >1. Should I see sth different in the explain extended output if I set and >unset the hive.optimize.bucketmapjoin option? > >2. Should I see something different in the output of hive while running the >query if again I set and unset the hive.optimize.bucketmapjoin? > >3. Is it possible that even though I set bucketmapjoin to true, Hive will >still perform a normal map-side join for some reason? How can I check if this >has actually happened? > >Thanks, >Avrilia > >