I think Chen wanted to know why this is two phased query if I understood it correctly
When you run a mapside join .. it just performs the join query .. after that to execute the group by part it launches the second job. I may be wrong but this is how I saw it whenever I executed group by queries On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <grover.markgro...@gmail.com>wrote: > Hi Chen, > I think we would need some more information. > > The query is referring to a table called "d" in the MAPJOIN hint but > there is not such table in the query. Moreover, Map joins only make > sense when the right table is the one being "mapped" (in other words, > being kept in memory) in case of a Left Outer Join, similarly if the > left table is the one being "mapped" in case of a Right Outer Join. > Let me know if this is not clear, I'd be happy to offer a better > explanation. > > In your query, the where clause on a column called "hour", at this > point I am unsure if that's a column of table1 or table2. If it's > column on table1, that predicate would get pushed up (if you have > hive.optimize.ppd property set to true), so it could possibly be done > in 1 MR job (I am not sure if that's presently the case, you will have > to check the explain plan). If however, the where clause is on a > column in the right table (table2 in your example), it can't be pushed > up since a column of the right table can have different values before > and after the LEFT OUTER JOIN. Therefore, the where clause would need > to be applied in a separate MR job. > > This is just my understanding, the full proof answer would lie in > checking out the explain plans and the Semantic Analyzer code. > > And for completeness, there is a conditional task (starting Hive 0.7) > that will convert your joins automatically to map joins where > applicable. This can be enabled by enabling hive.auto.convert.join > property. > > Mark > > On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <chen.song...@gmail.com> wrote: > > I have a silly question on how Hive interpretes a simple query with both > map > > side join and group by. > > > > Below query will translate into two jobs, with the 1st one as a map only > job > > doing the join and storing the output in a intermediary location, and the > > 2nd one as a map-reduce job taking the output of the 1st job as input and > > doing the group by. > > > > SELECT > > /*+ MAPJOIN(d) */ > > table.a, sum(table2.b) > > from table > > LEFT OUTER JOIN table2 > > ON table.id = table2.id > > where hour = '2012-12-11 11' > > group by table.a > > > > Why can't this be done within a single map reduce job? As what I can see > > from the query plan is that all 2nd job mapper do is taking the 1st job's > > mapper output. > > > > -- > > Chen Song > > > > > -- Nitin Pawar