I think Chen wanted to know why this is two phased query if I understood it
correctly

When you run a mapside join .. it just performs the join query .. after
that to execute the group by part it launches the second job.
I may be wrong but this is how I saw it whenever I executed group by
queries


On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <grover.markgro...@gmail.com>wrote:

> Hi Chen,
> I think we would need some more information.
>
> The query is referring to a table called "d" in the MAPJOIN hint but
> there is not such table in the query. Moreover, Map joins only make
> sense when the right table is the one being "mapped" (in other words,
> being kept in memory) in case of a Left Outer Join, similarly if the
> left table is the one being "mapped" in case of a Right Outer Join.
> Let me know if this is not clear, I'd be happy to offer a better
> explanation.
>
> In your query, the where clause on a column called "hour", at this
> point I am unsure if that's a column of table1 or table2. If it's
> column on table1, that predicate would get pushed up (if you have
> hive.optimize.ppd property set to true), so it could possibly be done
> in 1 MR job (I am not sure if that's presently the case, you will have
> to check the explain plan). If however, the where clause is on a
> column in the right table (table2 in your example), it can't be pushed
> up since a column of the right table can have different values before
> and after the LEFT OUTER JOIN. Therefore, the where clause would need
> to be applied in a separate MR job.
>
> This is just my understanding, the full proof answer would lie in
> checking out the explain plans and the Semantic Analyzer code.
>
> And for completeness, there is a conditional task (starting Hive 0.7)
> that will convert your joins automatically to map joins where
> applicable. This can be enabled by enabling hive.auto.convert.join
> property.
>
> Mark
>
> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <chen.song...@gmail.com> wrote:
> > I have a silly question on how Hive interpretes a simple query with both
> map
> > side join and group by.
> >
> > Below query will translate into two jobs, with the 1st one as a map only
> job
> > doing the join and storing the output in a intermediary location, and the
> > 2nd one as a map-reduce job taking the output of the 1st job as input and
> > doing the group by.
> >
> > SELECT
> > /*+ MAPJOIN(d) */
> > table.a, sum(table2.b)
> > from table
> > LEFT OUTER JOIN table2
> > ON table.id = table2.id
> > where hour = '2012-12-11 11'
> > group by table.a
> >
> > Why can't this be done within a single map reduce job? As what I can see
> > from the query plan is that all 2nd job mapper do is taking the 1st job's
> > mapper output.
> >
> > --
> > Chen Song
> >
> >
>



-- 
Nitin Pawar

Reply via email to