any views on the problem From: saurabhmishra.i...@outlook.com To: user@hive.apache.org; navis....@nexr.com Subject: RE: Hive Query Unable to distribute load evenly in reducers Date: Tue, 16 Oct 2012 11:23:29 +0530
by using mapjoin if you are implying setting set hive.auto.convert.join=true; then this configuration i am already using, but to no avail...:( Date: Tue, 16 Oct 2012 14:17:47 +0900 Subject: Re: Hive Query Unable to distribute load evenly in reducers From: navis....@nexr.com To: user@hive.apache.org How about using MapJoin? 2012/10/16 Saurabh Mishra <saurabhmishra.i...@outlook.com> no there is apparently no heavy skewing. also another stats i wanted to point was, following is approximate table contents in this 4 table join query : tableA : 170 million (actual number, + i am also exploding these records, so the number could be much much higher) tableB:15 tableC:45 tableD:45 tableE : 45 tableF : 14000 Also i cannot put any filter condition on tableA ,situation does not permit so. :( Kindly suggest, some alternative solution or some hive configuration to better load distribute in the reducers > Date: Mon, 15 Oct 2012 16:29:56 +0100 > Subject: Re: Hive Query Unable to distribute load evenly in reducers > From: philip.j.trom...@gmail.com > To: user@hive.apache.org > > Is your data heavily skewed towards certain values of a.x etc? > > On 15 October 2012 15:23, Saurabh Mishra <saurabhmishra.i...@outlook.com> > wrote: > > The queries are simple joins, something on the lines of > > select a, b, c, count(D) from tableA join tableB on a.x=b.y join.... group > > by a, b,c; > > > > > >> From: liy...@gmail.com > >> Date: Mon, 15 Oct 2012 21:10:39 +0800 > >> Subject: Re: Hive Query Unable to distribute load evenly in reducers > >> To: user@hive.apache.org > > > >> > >> And your queries were? > >> > >> On Mon, Oct 15, 2012 at 8:09 PM, Saurabh Mishra > >> <saurabhmishra.i...@outlook.com> wrote: > >> > Hi, > >> > I am firing some hive queries joining tables containing upto 30millions > >> > records each. Since the load on the reducers is very significant in > >> > these > >> > cases, i specifically set the following parameters before executing the > >> > queries : > >> > > >> > set mapred.reduce.tasks=100; > >> > set hive.exec.reducers.bytes.per.reducer=500000000; > >> > set hive.optimize.cp=true; > >> > > >> > The number of reducer the job spouts in now 160, but despite the high > >> > number > >> > most of the load remains upon 1 or 2 reducers. Hence in the final > >> > statistics, 158 reducers go completed with 2-3 minutes of start and 2 > >> > reducers took 2 hrs to run. > >> > Is there any way to overcome this load distribution disparity. > >> > Any help in this regards will be highly appreciated. > >> > > >> > Sincerely > >> > Saurabh Mishra