Thanks Nitin. This is all I want to clarify :) Chen
On Thu, Dec 13, 2012 at 2:30 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > to improve the speed of the job they created map only joins so that all > the records associated with a key fall to a map .. reducers slows it down. > If the reducer has to do some more job then they launch another job. > > bear in mind, when we say map only join we are absolutely sure that speed > will increase in case data in one of the tables is in the few hundred MB > ranges. If this has to do with reduce in hand, the processing logic > completely changes and it also slows down. > > Launching a new job for group by is a neat way to measure how much time > you spent on just join and another on group by so you can easily see two > different things. > > There is no way you can ask a mapjoin to launch a reducer as it is not > supposed to do. > > If you have such case (may be if you think that it will improve > performance), please feel free to raise a jira and get it reviewed. if its > valid I think people will provide more ideas > > > On Fri, Dec 14, 2012 at 12:42 AM, Chen Song <chen.song...@gmail.com>wrote: > >> Nitin >> >> Yeah. My original question is that is there a way to force Hive (or >> rather to say, is it possible) to execute map side join at mapper phase and >> group by in reduce phase. So instead of launching a map only job (join) and >> map reduce job (group by), doing it altogether in a single MR job. This is >> obviously not what Hive does but I am wondering if it is a nice feature to >> have. >> >> The point you made (different keys in join and group by) only matters >> when it is the time in reduce phase, right? As map side join takes care of >> join at mapper phase, it sounds to me natural that group by can be done in >> the reduce phase in the same job. The only hassle that I can think of is >> that map output have to be resorted (based on group by keys). >> >> Chen >> >> On Thu, Dec 13, 2012 at 1:42 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: >> >>> chen in mapside join .. there are no reducers .. its MAP ONLY job >>> >>> >>> On Thu, Dec 13, 2012 at 11:54 PM, Chen Song <chen.song...@gmail.com>wrote: >>> >>>> Understood that fact that it is impossible in the same MR job if both >>>> join and group by are gonna happen in the reduce phase (because the join >>>> keys and group by keys are different). But for map side join, the joins >>>> would be complete by the end of the map phase, and outputs should be ready >>>> to be distributed to reducers based on group by keys. >>>> >>>> Chen >>>> >>>> >>>> On Thu, Dec 13, 2012 at 11:04 AM, Nitin Pawar >>>> <nitinpawar...@gmail.com>wrote: >>>> >>>>> Thats because for the first job the join keys are different and second >>>>> job group by keys are different, you just cant assume join keys and group >>>>> keys will be same so they are two different jobs >>>>> >>>>> >>>>> On Thu, Dec 13, 2012 at 8:26 PM, Chen Song <chen.song...@gmail.com>wrote: >>>>> >>>>>> Yeah, my abridged version of query might be a little broken but my >>>>>> point is that when a query has a map join and group by, even in its >>>>>> simplified incarnation, it will launch two jobs. I was just wondering why >>>>>> map join and group by cannot be accomplished in one MR job. >>>>>> >>>>>> Best, >>>>>> Chen >>>>>> >>>>>> >>>>>> On Thu, Dec 13, 2012 at 12:30 AM, Nitin Pawar < >>>>>> nitinpawar...@gmail.com> wrote: >>>>>> >>>>>>> I think Chen wanted to know why this is two phased query if I >>>>>>> understood it correctly >>>>>>> >>>>>>> When you run a mapside join .. it just performs the join query .. >>>>>>> after that to execute the group by part it launches the second job. >>>>>>> I may be wrong but this is how I saw it whenever I executed group by >>>>>>> queries >>>>>>> >>>>>>> >>>>>>> On Thu, Dec 13, 2012 at 7:11 AM, Mark Grover < >>>>>>> grover.markgro...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi Chen, >>>>>>>> I think we would need some more information. >>>>>>>> >>>>>>>> The query is referring to a table called "d" in the MAPJOIN hint but >>>>>>>> there is not such table in the query. Moreover, Map joins only make >>>>>>>> sense when the right table is the one being "mapped" (in other >>>>>>>> words, >>>>>>>> being kept in memory) in case of a Left Outer Join, similarly if the >>>>>>>> left table is the one being "mapped" in case of a Right Outer Join. >>>>>>>> Let me know if this is not clear, I'd be happy to offer a better >>>>>>>> explanation. >>>>>>>> >>>>>>>> In your query, the where clause on a column called "hour", at this >>>>>>>> point I am unsure if that's a column of table1 or table2. If it's >>>>>>>> column on table1, that predicate would get pushed up (if you have >>>>>>>> hive.optimize.ppd property set to true), so it could possibly be >>>>>>>> done >>>>>>>> in 1 MR job (I am not sure if that's presently the case, you will >>>>>>>> have >>>>>>>> to check the explain plan). If however, the where clause is on a >>>>>>>> column in the right table (table2 in your example), it can't be >>>>>>>> pushed >>>>>>>> up since a column of the right table can have different values >>>>>>>> before >>>>>>>> and after the LEFT OUTER JOIN. Therefore, the where clause would >>>>>>>> need >>>>>>>> to be applied in a separate MR job. >>>>>>>> >>>>>>>> This is just my understanding, the full proof answer would lie in >>>>>>>> checking out the explain plans and the Semantic Analyzer code. >>>>>>>> >>>>>>>> And for completeness, there is a conditional task (starting Hive >>>>>>>> 0.7) >>>>>>>> that will convert your joins automatically to map joins where >>>>>>>> applicable. This can be enabled by enabling hive.auto.convert.join >>>>>>>> property. >>>>>>>> >>>>>>>> Mark >>>>>>>> >>>>>>>> On Wed, Dec 12, 2012 at 3:32 PM, Chen Song <chen.song...@gmail.com> >>>>>>>> wrote: >>>>>>>> > I have a silly question on how Hive interpretes a simple query >>>>>>>> with both map >>>>>>>> > side join and group by. >>>>>>>> > >>>>>>>> > Below query will translate into two jobs, with the 1st one as a >>>>>>>> map only job >>>>>>>> > doing the join and storing the output in a intermediary location, >>>>>>>> and the >>>>>>>> > 2nd one as a map-reduce job taking the output of the 1st job as >>>>>>>> input and >>>>>>>> > doing the group by. >>>>>>>> > >>>>>>>> > SELECT >>>>>>>> > /*+ MAPJOIN(d) */ >>>>>>>> > table.a, sum(table2.b) >>>>>>>> > from table >>>>>>>> > LEFT OUTER JOIN table2 >>>>>>>> > ON table.id = table2.id >>>>>>>> > where hour = '2012-12-11 11' >>>>>>>> > group by table.a >>>>>>>> > >>>>>>>> > Why can't this be done within a single map reduce job? As what I >>>>>>>> can see >>>>>>>> > from the query plan is that all 2nd job mapper do is taking the >>>>>>>> 1st job's >>>>>>>> > mapper output. >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Chen Song >>>>>>>> > >>>>>>>> > >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Nitin Pawar >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Chen Song >>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Nitin Pawar >>>>> >>>> >>>> >>>> >>>> -- >>>> Chen Song >>>> >>>> >>>> >>> >>> >>> -- >>> Nitin Pawar >>> >> >> >> >> -- >> Chen Song >> >> >> > > > -- > Nitin Pawar > -- Chen Song