Did you restart cluster after reconfiguration ?
On Fri, Oct 8, 2010 at 9:59 PM, Vincent <vincent.hervi...@gmail.com> wrote: > I've tried with mapred.child.java.opts value: > -Xmx512m --> still memory errors in reduce phase > -Xmx1024m --> still memory errors in reduce phase > I am now trying with -Xmx1536m but I'm afraid that my nodes will start to > swap memory... > > Should I continue in this direction? Or it's already to much and I should > search the problem somewhere else? > > Thanks > > -Vincent > > > On 10/08/2010 03:04 PM, Jeff Zhang wrote: >> >> Try to increase the heap size on of task by setting >> mapred.child.java.opts in mapred-site.xml. The default value is >> -Xmx200m in mapred-default.xml which may be too small for you. >> >> >> >> On Fri, Oct 8, 2010 at 6:55 PM, Vincent<vincent.hervi...@gmail.com> >> wrote: >>> >>> >>> Thanks to Dmitriy and Jeff, I've set : >>> >>> set default_parallel 20; at the beginning of my script. >>> >>> Updated 8 JOINs to behave like: >>> >>> JOIN big BY id, small BY id USING 'replicated'; >>> >>> Unfortunately this didn't improve the script speed (at least it runs for >>> more than one hour now). >>> >>> But Looking in the jobtracker one of the job which reduce, I can see for >>> the >>> map: >>> >>> >>> Hadoop map task list for job_201010081314_0010 >>> <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010> on >>> prog7<http://prog7.lan:50030/jobtracker.jsp> >>> >>> ------------------------------------------------------------------------ >>> >>> >>> All Tasks >>> >>> Task Complete Status Start Time Finish Time Errors >>> Counters >>> task_201010081314_0010_m_000000 >>> >>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000> >>> 100.00% >>> >>> >>> 8-Oct-2010 14:07:44 >>> 8-Oct-2010 14:23:11 (15mins, 27sec) >>> >>> >>> Too many fetch-failures >>> Too many fetch-failures >>> >>> >>> 8 >>> >>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000> >>> >>> >>> And I can see this for the reduce >>> >>> >>> Hadoop reduce task list for job_201010081314_0010 >>> <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010> on >>> prog7<http://prog7.lan:50030/jobtracker.jsp> >>> >>> ------------------------------------------------------------------------ >>> >>> >>> All Tasks >>> >>> Task Complete Status Start Time Finish Time Errors >>> Counters >>> task_201010081314_0010_r_000000 >>> >>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000> >>> 9.72% >>> >>> >>> >>> reduce> copy (7 of 24 at 0.01 MB/s)> >>> 8-Oct-2010 14:14:49 >>> >>> >>> >>> Error: GC overhead limit exceeded >>> >>> >>> 7 >>> >>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000> >>> task_201010081314_0010_r_000001 >>> >>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001> >>> 0.00% >>> >>> >>> 8-Oct-2010 14:14:52 >>> >>> >>> >>> Error: Java heap space >>> >>> >>> 0 >>> >>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001> >>> task_201010081314_0010_r_000002 >>> >>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002> >>> 0.00% >>> >>> >>> 8-Oct-2010 14:15:58 >>> >>> >>> >>> java.io.IOException: Task process exit with nonzero status of 1. >>> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) >>> >>> >>> >>> 0 >>> >>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002> >>> task_201010081314_0010_r_000003 >>> >>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003> >>> 9.72% >>> >>> >>> >>> reduce> copy (7 of 24 at 0.01 MB/s)> >>> 8-Oct-2010 14:16:58 >>> >>> >>> 7 >>> >>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003> >>> task_201010081314_0010_r_000004 >>> >>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004> >>> 0.00% >>> >>> >>> 8-Oct-2010 14:18:11 >>> >>> >>> >>> Error: GC overhead limit exceeded >>> >>> >>> 0 >>> >>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004> >>> task_201010081314_0010_r_000005 >>> >>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000005> >>> 0.00% >>> >>> >>> 8-Oct-2010 14:18:56 >>> >>> >>> >>> Error: GC overhead limit exceeded >>> >>> >>> >>> >>> >>> >>> Seems like it runs out of memory... Which parameter should be increased? >>> >>> -Vincent >>> >>> >>> On 10/08/2010 01:12 PM, Jeff Zhang wrote: >>>> >>>> BTW, you can look at the job tracker web ui to see which part of the >>>> job cost the most of the time >>>> >>>> >>>> >>>> On Fri, Oct 8, 2010 at 5:11 PM, Jeff Zhang<zjf...@gmail.com> wrote: >>>>> >>>>> No I mean whether your mapreduce job's reduce task number is 1. >>>>> >>>>> And could you share your pig script, then others can really understand >>>>> your problem. >>>>> >>>>> >>>>> >>>>> On Fri, Oct 8, 2010 at 5:04 PM, Vincent<vincent.hervi...@gmail.com> >>>>> wrote: >>>>>> >>>>>> You are right, I didn't change this parameter, therefore the default >>>>>> is >>>>>> used from src/mapred/mapred-default.xml >>>>>> >>>>>> <property> >>>>>> <name>mapred.reduce.tasks</name> >>>>>> <value>1</value> >>>>>> <description>The default number of reduce tasks per job. Typically set >>>>>> to >>>>>> 99% >>>>>> of the cluster's reduce capacity, so that if a node fails the reduces >>>>>> can >>>>>> still be executed in a single wave. >>>>>> Ignored when mapred.job.tracker is "local". >>>>>> </description> >>>>>> </property> >>>>>> >>>>>> Not clear for me what is the reduce capacity of my cluster :) >>>>>> >>>>>> On 10/08/2010 01:00 PM, Jeff Zhang wrote: >>>>>>> >>>>>>> I guess maybe your reduce number is 1 which cause the reduce phase >>>>>>> very >>>>>>> slowly. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Fri, Oct 8, 2010 at 4:44 PM, Vincent<vincent.hervi...@gmail.com> >>>>>>> wrote: >>>>>>>> >>>>>>>> Well I can see from the job tracker that all the jobs are done >>>>>>>> quite >>>>>>>> quickly expect 2 for which reduce phase goes really really slowly. >>>>>>>> >>>>>>>> But how can I make the parallel between a job in the Hadoop jop >>>>>>>> tracker >>>>>>>> (example: job_201010072150_0045) and the Pig script execution? >>>>>>>> >>>>>>>> And what is the most efficient: several small Pig scripts? or one >>>>>>>> big >>>>>>>> Pig >>>>>>>> script? I did one big to avoid to load several time the same logs in >>>>>>>> different scripts. Maybe it is not so good design... >>>>>>>> >>>>>>>> Thanks for your help. >>>>>>>> >>>>>>>> - Vincent >>>>>>>> >>>>>>>> >>>>>>>> On 10/08/2010 11:31 AM, Vincent wrote: >>>>>>>>> >>>>>>>>> I'm using pig-0.7.0 on hadoop-0.20.2. >>>>>>>>> >>>>>>>>> For the script, well it's more then 500 lines, I'm not sure if I >>>>>>>>> post >>>>>>>>> it >>>>>>>>> here that somebody will read it till the end :-) >>>>>>>>> >>>>>>>>> >>>>>>>>> On 10/08/2010 11:26 AM, Dmitriy Ryaboy wrote: >>>>>>>>>> >>>>>>>>>> What version of Pig, and what does your script look like? >>>>>>>>>> >>>>>>>>>> On Thu, Oct 7, 2010 at 11:48 PM, >>>>>>>>>> Vincent<vincent.hervi...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi All, >>>>>>>>>>> >>>>>>>>>>> I'm quite new to Pig/Hadoop. So maybe my cluster size will make >>>>>>>>>>> you >>>>>>>>>>> laugh. >>>>>>>>>>> >>>>>>>>>>> I wrote a script on Pig handling 1.5GB of logs in less than one >>>>>>>>>>> hour >>>>>>>>>>> in >>>>>>>>>>> pig >>>>>>>>>>> local mode on a Intel core 2 duo with 3GB of RAM. >>>>>>>>>>> >>>>>>>>>>> Then I tried this script on a simple 2 nodes cluster. These 2 >>>>>>>>>>> nodes >>>>>>>>>>> are >>>>>>>>>>> not >>>>>>>>>>> servers but simple computers: >>>>>>>>>>> - Intel core 2 duo with 3GB of RAM. >>>>>>>>>>> - Intel Quad with 4GB of RAM. >>>>>>>>>>> >>>>>>>>>>> Well I was aware that hadoop has overhead and that it won't be >>>>>>>>>>> done >>>>>>>>>>> in >>>>>>>>>>> half >>>>>>>>>>> an hour (time in local divided by number of nodes). But I was >>>>>>>>>>> surprised >>>>>>>>>>> to >>>>>>>>>>> see this morning it took 7 hours to complete!!! >>>>>>>>>>> >>>>>>>>>>> My configuration was made according to this link: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29 >>>>>>>>>>> >>>>>>>>>>> My question is simple: Is it normal? >>>>>>>>>>> >>>>>>>>>>> Cheers >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Vincent >>>>>>>>>>> >>>>>>>>>>> >>>>> >>>>> -- >>>>> Best Regards >>>>> >>>>> Jeff Zhang >>>>> >>>> >>> >> >> > > -- Best Regards Jeff Zhang