Re: map side join with group by

Chen Song Thu, 13 Dec 2012 11:51:08 -0800

Thanks Nitin. This is all I want to clarify :)

Chen


On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:

> to improve the speed of the job they created map only joins so that all
> the records associated with a key fall to a map .. reducers slows it down.
> If the reducer has to do some more job then they launch another job.
>
> bear in mind, when we say map only join we are absolutely sure that speed
> will increase in case data in one of the tables is in the few hundred MB
> ranges. If this has to do with reduce in hand, the processing logic
> completely changes and it also slows down.
>
> Launching a new job for group by is a neat way to measure how much time
> you spent on just join and another on group by so you can easily see two
> different things.
>
> There is no way you can ask a mapjoin to launch a reducer as it is not
> supposed to do.
>
> If you have such case (may be if you think that it will improve
> performance), please feel free to raise a jira and get it reviewed. if its
> valid I think people will provide more ideas
>
>
> On Fri, Dec 14, 2012 at 12:42 AM, Chen Song <chen.song...@gmail.com>wrote:
>
>> Nitin
>>
>> Yeah. My original question is that is there a way to force Hive (or
>> rather to say, is it possible) to execute map side join at mapper phase and
>> group by in reduce phase. So instead of launching a map only job (join) and
>> map reduce job (group by), doing it altogether in a single MR job. This is
>> obviously not what Hive does but I am wondering if it is a nice feature to
>> have.
>>
>> The point you made (different keys in join and group by) only matters
>> when it is the time in reduce phase, right? As map side join takes care of
>> join at mapper phase, it sounds to me natural that group by can be done in
>> the reduce phase in the same job. The only hassle that I can think of is
>> that map output have to be resorted (based on group by keys).
>>
>> Chen
>>
>> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote:
>>
>>> chen in mapside join .. there are no reducers .. its MAP ONLY job
>>>
>>>
>>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <chen.song...@gmail.com>wrote:
>>>
>>>> Understood that fact that it is impossible in the same MR job if both
>>>> join and group by are gonna happen in the reduce phase (because the join
>>>> keys and group by keys are different). But for map side join, the joins
>>>> would be complete by the end of the map phase, and outputs should be ready
>>>> to be distributed to reducers based on group by keys.
>>>>
>>>> Chen
>>>>
>>>>
>>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar 
>>>> <nitinpawar...@gmail.com>wrote:
>>>>
>>>>> Thats because for the first job the join keys are different and second
>>>>> job group by keys are different, you just cant assume join keys and group
>>>>> keys will be same so they are two different jobs
>>>>>
>>>>>
>>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <chen.song...@gmail.com>wrote:
>>>>>
>>>>>> Yeah, my abridged version of query might be a little broken but my
>>>>>> point is that when a query has a map join and group by, even in its
>>>>>> simplified incarnation, it will launch two jobs. I was just wondering why
>>>>>> map join and group by cannot be accomplished in one MR job.
>>>>>>
>>>>>> Best,
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar <
>>>>>> nitinpawar...@gmail.com> wrote:
>>>>>>
>>>>>>> I think Chen wanted to know why this is two phased query if I
>>>>>>> understood it correctly
>>>>>>>
>>>>>>> When you run a mapside join .. it just performs the join query ..
>>>>>>> after that to execute the group by part it launches the second job.
>>>>>>> I may be wrong but this is how I saw it whenever I executed group by
>>>>>>> queries
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover <
>>>>>>> grover.markgro...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Chen,
>>>>>>>> I think we would need some more information.
>>>>>>>>
>>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but
>>>>>>>> there is not such table in the query. Moreover, Map joins only make
>>>>>>>> sense when the right table is the one being "mapped" (in other
>>>>>>>> words,
>>>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the
>>>>>>>> left table is the one being "mapped" in case of a Right Outer Join.
>>>>>>>> Let me know if this is not clear, I'd be happy to offer a better
>>>>>>>> explanation.
>>>>>>>>
>>>>>>>> In your query, the where clause on a column called "hour", at this
>>>>>>>> point I am unsure if that's a column of table1 or table2. If it's
>>>>>>>> column on table1, that predicate would get pushed up (if you have
>>>>>>>> hive.optimize.ppd property set to true), so it could possibly be
>>>>>>>> done
>>>>>>>> in 1 MR job (I am not sure if that's presently the case, you will
>>>>>>>> have
>>>>>>>> to check the explain plan). If however, the where clause is on a
>>>>>>>> column in the right table (table2 in your example), it can't be
>>>>>>>> pushed
>>>>>>>> up since a column of the right table can have different values
>>>>>>>> before
>>>>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would
>>>>>>>> need
>>>>>>>> to be applied in a separate MR job.
>>>>>>>>
>>>>>>>> This is just my understanding, the full proof answer would lie in
>>>>>>>> checking out the explain plans and the Semantic Analyzer code.
>>>>>>>>
>>>>>>>> And for completeness, there is a conditional task (starting Hive
>>>>>>>> 0.7)
>>>>>>>> that will convert your joins automatically to map joins where
>>>>>>>> applicable. This can be enabled by enabling hive.auto.convert.join
>>>>>>>> property.
>>>>>>>>
>>>>>>>> Mark
>>>>>>>>
>>>>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <chen.song...@gmail.com>
>>>>>>>> wrote:
>>>>>>>> > I have a silly question on how Hive interpretes a simple query
>>>>>>>> with both map
>>>>>>>> > side join and group by.
>>>>>>>> >
>>>>>>>> > Below query will translate into two jobs, with the 1st one as a
>>>>>>>> map only job
>>>>>>>> > doing the join and storing the output in a intermediary location,
>>>>>>>> and the
>>>>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as
>>>>>>>> input and
>>>>>>>> > doing the group by.
>>>>>>>> >
>>>>>>>> > SELECT
>>>>>>>> > /*+ MAPJOIN(d) */
>>>>>>>> > table.a, sum(table2.b)
>>>>>>>> > from table
>>>>>>>> > LEFT OUTER JOIN table2
>>>>>>>> > ON table.id = table2.id
>>>>>>>> > where hour = '2012-12-11 11'
>>>>>>>> > group by table.a
>>>>>>>> >
>>>>>>>> > Why can't this be done within a single map reduce job? As what I
>>>>>>>> can see
>>>>>>>> > from the query plan is that all 2nd job mapper do is taking the
>>>>>>>> 1st job's
>>>>>>>> > mapper output.
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Chen Song
>>>>>>>> >
>>>>>>>> >
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nitin Pawar
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Chen Song
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nitin Pawar
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nitin Pawar
>>>
>>
>>
>>
>> --
>> Chen Song
>>
>>
>>
>
>
> --
> Nitin Pawar
>



-- 
Chen Song

Re: map side join with group by

Reply via email to