Re: Why does SMB join generate hash table locally, even if input tables are large?

Pala M Muthaia Thu, 14 Aug 2014 16:13:23 -0700

If anybody is interested:

 To enable SMB join, in addition to the config values listed above, i had
to set the following as well:


set hive.auto.convert.sortmerge.join = true;

By default, the value was false. After the above, i saw a map only job as
expected.


Thanks.


On Mon, Aug 4, 2014 at 6:10 PM, Pala M Muthaia <mchett...@rocketfuelinc.com>
wrote:

> Thanks for the response Navis.
>
> I tried the repro again from the beginning, and it doesn't result in hash
> table generation. I may have had some setting that enforced map join. The
> plan generated shows a conditional stage pointing to a simple map and
> reduce stage.
>
> At runtime, however, the query results in a MR job with a reduce stage
> that performs the join.
>
> Shouldn't SMB join result in a map only job for a table bucketed and
> sorted on join column? Is there size restriction on SMB join (i.e. SMB join
> kicks in only if bucket sizes are below some limit?)
>
> Thanks.
>
>
>
> On Sun, Aug 3, 2014 at 7:20 PM, Navis류승우 <navis....@nexr.com> wrote:
>
>> I don't think hash table generation is needed for SMB joins. Could you
>> check the result of explain extended?
>>
>> Thanks,
>> Navis
>>
>>
>> 2014-07-31 4:08 GMT+09:00 Pala M Muthaia <mchett...@rocketfuelinc.com>:
>>
>> > +hive-users
>> >
>> >
>> > On Tue, Jul 29, 2014 at 1:56 PM, Pala M Muthaia <
>> > mchett...@rocketfuelinc.com
>> > > wrote:
>> >
>> > > Hi,
>> > >
>> > > I am testing SMB join for 2 large tables. The tables are bucketed and
>> > > sorted on the join column. I notice that even though the table is
>> large,
>> > > Hive attempts to generate hash table for the 'small' table locally,
>> > >  similar to map join. Since the table is large in my case, the client
>> > runs
>> > > out of memory and the query fails.
>> > >
>> > > I am using Hive 0.12 with the following settings:
>> > >
>> > > set hive.optimize.bucketmapjoin=true;
>> > > set hive.optimize.bucketmapjoin.sortedmerge=true;
>> > > set hive.input.format =
>> > > org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
>> > >
>> > > My test query does a simple join and a select, no subqueries/nested
>> > > queries etc.
>> > >
>> > > I understand why a (bucket) map join requires hash table generation,
>> but
>> > > why is that included for an SMB join? Shouldn't a SMB join just spin
>> up
>> > one
>> > > mapper for each bucket and perform a sort merge join directly on the
>> > mapper?
>> > >
>> > >
>> > > Thanks,
>> > > pala
>> > >
>> > >
>> > >
>> > >
>> >
>>
>
>

Re: Why does SMB join generate hash table locally, even if input tables are large?

Reply via email to