If anybody is interested: To enable SMB join, in addition to the config values listed above, i had to set the following as well:
set hive.auto.convert.sortmerge.join = true; By default, the value was false. After the above, i saw a map only job as expected. Thanks. On Mon, Aug 4, 2014 at 6:10 PM, Pala M Muthaia <mchett...@rocketfuelinc.com> wrote: > Thanks for the response Navis. > > I tried the repro again from the beginning, and it doesn't result in hash > table generation. I may have had some setting that enforced map join. The > plan generated shows a conditional stage pointing to a simple map and > reduce stage. > > At runtime, however, the query results in a MR job with a reduce stage > that performs the join. > > Shouldn't SMB join result in a map only job for a table bucketed and > sorted on join column? Is there size restriction on SMB join (i.e. SMB join > kicks in only if bucket sizes are below some limit?) > > Thanks. > > > > On Sun, Aug 3, 2014 at 7:20 PM, Navis류승우 <navis....@nexr.com> wrote: > >> I don't think hash table generation is needed for SMB joins. Could you >> check the result of explain extended? >> >> Thanks, >> Navis >> >> >> 2014-07-31 4:08 GMT+09:00 Pala M Muthaia <mchett...@rocketfuelinc.com>: >> >> > +hive-users >> > >> > >> > On Tue, Jul 29, 2014 at 1:56 PM, Pala M Muthaia < >> > mchett...@rocketfuelinc.com >> > > wrote: >> > >> > > Hi, >> > > >> > > I am testing SMB join for 2 large tables. The tables are bucketed and >> > > sorted on the join column. I notice that even though the table is >> large, >> > > Hive attempts to generate hash table for the 'small' table locally, >> > > similar to map join. Since the table is large in my case, the client >> > runs >> > > out of memory and the query fails. >> > > >> > > I am using Hive 0.12 with the following settings: >> > > >> > > set hive.optimize.bucketmapjoin=true; >> > > set hive.optimize.bucketmapjoin.sortedmerge=true; >> > > set hive.input.format = >> > > org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat; >> > > >> > > My test query does a simple join and a select, no subqueries/nested >> > > queries etc. >> > > >> > > I understand why a (bucket) map join requires hash table generation, >> but >> > > why is that included for an SMB join? Shouldn't a SMB join just spin >> up >> > one >> > > mapper for each bucket and perform a sort merge join directly on the >> > mapper? >> > > >> > > >> > > Thanks, >> > > pala >> > > >> > > >> > > >> > > >> > >> > >