I have a silly question on how Hive interpretes a simple query with both map side join and group by.
Below query will translate into two jobs, with the 1st one as a map only job doing the join and storing the output in a intermediary location, and the 2nd one as a map-reduce job taking the output of the 1st job as input and doing the group by. SELECT /*+ MAPJOIN(d) */ table.a, sum(table2.b) from table LEFT OUTER JOIN table2 ON table.id = table2.id where hour = '2012-12-11 11' group by table.a Why can't this be done within a single map reduce job? As what I can see from the query plan is that all 2nd job mapper do is taking the 1st job's mapper output. -- Chen Song