If the join is a reduce side join,
https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query
and generate a single MR job. The optimizer introduced by HIVE-2206 is in
trunk. Currently, it only handles the same column(s).

If the join is a MapJoin, hive 0.11 can generate a single MR job (In this
case, if join and group by use the same column(s) does not matter). To
enable it, you need to ...
set hive.auto.convert.join=true
set hive.auto.convert.join.noconditionaltask=true;
set hive.optimize.mapjoin.mapreduce=true;
and also make sure hive.auto.convert.join.noconditionaltask.size is larger
than the size of the small table.
For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the
flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will
not need to set hive.optimize.mapjoin.mapreduce.

Thanks,

Yin


On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague <sprag...@gmail.com> wrote:

> and what version of hive are you running your test on?  i do believe - not
> certain - that hive 0.11 includes the optimization you seek.
>
>
> On Thu, Aug 1, 2013 at 10:19 AM, Chen Song <chen.song...@gmail.com> wrote:
>
>> Suppose we have 2 simple tables
>>
>> A
>> id int
>> value string
>>
>> B
>> id
>>
>> When hive translates the following query
>>
>> select max(A.value), A.id from A join B on A.id = A.id group by A.id;
>>
>> It launches 2 stages, one for the join and one for the group by.
>>
>> My understanding is that if the join key set is a sub set of the group by
>> key set, it can be achieved in the same map reduce job. If that is correct
>> in theory, could it be a feature in hive?
>>
>> Chen
>>
>>
>

Reply via email to