If the join is a reduce side join, https://issues.apache.org/jira/browse/HIVE-2206 will optimize this query and generate a single MR job. The optimizer introduced by HIVE-2206 is in trunk. Currently, it only handles the same column(s).
If the join is a MapJoin, hive 0.11 can generate a single MR job (In this case, if join and group by use the same column(s) does not matter). To enable it, you need to ... set hive.auto.convert.join=true set hive.auto.convert.join.noconditionaltask=true; set hive.optimize.mapjoin.mapreduce=true; and also make sure hive.auto.convert.join.noconditionaltask.size is larger than the size of the small table. For hive trunk, https://issues.apache.org/jira/browse/HIVE-4827 drops the flag of "hive.optimize.mapjoin.mapreduce". So, in future release, you will not need to set hive.optimize.mapjoin.mapreduce. Thanks, Yin On Thu, Aug 1, 2013 at 5:32 PM, Stephen Sprague <sprag...@gmail.com> wrote: > and what version of hive are you running your test on? i do believe - not > certain - that hive 0.11 includes the optimization you seek. > > > On Thu, Aug 1, 2013 at 10:19 AM, Chen Song <chen.song...@gmail.com> wrote: > >> Suppose we have 2 simple tables >> >> A >> id int >> value string >> >> B >> id >> >> When hive translates the following query >> >> select max(A.value), A.id from A join B on A.id = A.id group by A.id; >> >> It launches 2 stages, one for the join and one for the group by. >> >> My understanding is that if the join key set is a sub set of the group by >> key set, it can be achieved in the same map reduce job. If that is correct >> in theory, could it be a feature in hive? >> >> Chen >> >> >