Do the joins share the same key? 2012/3/13 Bruce Bian <weidong....@gmail.com>
> Yes,it's in my hive-default.xml and Hive figured to use one reducer only, > so I thought increase it to 5 might help,which doesn't. > Anyway, to scan the largest table 6 times isn't efficient hence my > question. > > > On Wed, Mar 14, 2012 at 12:37 AM, Jagat <jagatsi...@gmail.com> wrote: > > > > Hello Weidong Bian > > > > Did you see the following configuration properties in conf directory > > > > > > <property> > > <name>mapred.reduce.tasks</name> > > <value>-1</value> > > <description>The default number of reduce tasks per job. Typically > set > > to a prime close to the number of available hosts. Ignored when > > mapred.job.tracker is "local". Hadoop set this to 1 by default, > whereas hive uses -1 as its default value. > > By setting this property to -1, Hive will automatically figure out > what should be the number of reducers. > > </description> > > </property> > > > > > > <property> > > <name>hive.exec.reducers.max</name> > > <value>999</value> > > <description>max number of reducers will be used. If the one > > specified in the configuration parameter mapred.reduce.tasks is > > negative, hive will use this one as the max number of reducers when > > automatically determine number of reducers.</description> > > </property> > > > > Thanks and Regards > > > > Jagat > > > > > > On Tue, Mar 13, 2012 at 9:54 PM, Bruce Bian <weidong....@gmail.com> > wrote: > >> > >> Hi there, > >> when I'm using Hive to doing a query as follows, 6 Map/Reduce jobs are > launched, one for each join, and it deals with ~460M data in ~950 seconds, > which I think is way toooo slow for a cluster with 5 slaves and 24GB > memory/12 disks each. > >> set mapred.reduce.tasks=5; > >> SELECT a.*,e.code_name as is_internet_flg, f.code_name as > wb_access_tp_desc, g.code_name as free_tp_desc, > >> b.acnt_no,b.addr_id,b.postcode,b.acnt_rmnd_tp,b.print_tp,b.media_type, > >> c.cust_code,c.root_cust_code, > >> > d.mdf_name,d.sub_bureau_code,d.bureau_cd,d.adm_sub_bureau_name,d.bureau_name > >> FROM prc_idap_pi_root a > >> LEFT OUTER JOIN idap_pi_root_acnt b ON a.acnt_id=b.acnt_id > >> LEFT OUTER JOIN idap_pi_root_cust c ON a.cust_id=c.cust_id > >> LEFT OUTER JOIN ocrm_vt_area d ON a.dev_area_id=d.area_id > >> LEFT OUTER JOIN osor_code e ON a.data_internet_flg=e.code_val and > e.code_tp='IS_INTERNET_FLG' > >> LEFT OUTER JOIN osor_code f ON a.wb_access_tp=f.code_val and > f.code_tp='WEB_ACCESS_TP' > >> LEFT OUTER JOIN osor_code g ON a.free_tp=g.code_val and > g.code_tp='FREE_TP'; > >> For each jobs, most of the time is consumed by the reduce jobs. As the > idap_pi_root is very large, to scan over it for 6 times is quite > inefficient. Is it possible to reduce the map/reduce jobs to only one? > >> Thanks, > >> Weidong Bian > >