Did you restart cluster after reconfiguration ?

On Fri, Oct 8, 2010 at 9:59 PM, Vincent <vincent.hervi...@gmail.com> wrote:
>  I've tried with mapred.child.java.opts value:
> -Xmx512m --> still memory errors in reduce phase
> -Xmx1024m --> still memory errors in reduce phase
> I am now trying with -Xmx1536m but I'm afraid that my nodes will start to
> swap memory...
>
> Should I continue in this direction? Or it's already to much and I should
> search the problem somewhere else?
>
> Thanks
>
> -Vincent
>
>
> On 10/08/2010 03:04 PM, Jeff Zhang wrote:
>>
>> Try to increase the heap size on of task by setting
>> mapred.child.java.opts in mapred-site.xml. The default value is
>> -Xmx200m in mapred-default.xml which may be too small for you.
>>
>>
>>
>> On Fri, Oct 8, 2010 at 6:55 PM, Vincent<vincent.hervi...@gmail.com>
>>  wrote:
>>>
>>>
>>>  Thanks to Dmitriy and Jeff, I've set :
>>>
>>> set default_parallel 20; at the beginning of my script.
>>>
>>> Updated 8 JOINs to behave like:
>>>
>>> JOIN big BY id, small BY id USING 'replicated';
>>>
>>> Unfortunately this didn't improve the script speed (at least it runs for
>>> more than one hour now).
>>>
>>> But Looking in the jobtracker one of the job which reduce, I can see for
>>> the
>>> map:
>>>
>>>
>>>  Hadoop map task list for job_201010081314_0010
>>>  <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010>  on
>>>  prog7<http://prog7.lan:50030/jobtracker.jsp>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>   All Tasks
>>>
>>> Task    Complete        Status  Start Time      Finish Time     Errors
>>>  Counters
>>> task_201010081314_0010_m_000000
>>>
>>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000>
>>>      100.00%
>>>
>>>
>>>        8-Oct-2010 14:07:44
>>>        8-Oct-2010 14:23:11 (15mins, 27sec)
>>>
>>>
>>> Too many fetch-failures
>>> Too many fetch-failures
>>>
>>>
>>>        8
>>>
>>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_m_000000>
>>>
>>>
>>> And I can see this for the reduce
>>>
>>>
>>>  Hadoop reduce task list for job_201010081314_0010
>>>  <http://prog7.lan:50030/jobdetails.jsp?jobid=job_201010081314_0010>  on
>>>  prog7<http://prog7.lan:50030/jobtracker.jsp>
>>>
>>> ------------------------------------------------------------------------
>>>
>>>
>>>   All Tasks
>>>
>>> Task    Complete        Status  Start Time      Finish Time     Errors
>>>  Counters
>>> task_201010081314_0010_r_000000
>>>
>>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000>
>>>      9.72%
>>>
>>>
>>>
>>>        reduce>  copy (7 of 24 at 0.01 MB/s)>
>>>        8-Oct-2010 14:14:49
>>>
>>>
>>>
>>> Error: GC overhead limit exceeded
>>>
>>>
>>>        7
>>>
>>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000000>
>>> task_201010081314_0010_r_000001
>>>
>>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001>
>>>      0.00%
>>>
>>>
>>>        8-Oct-2010 14:14:52
>>>
>>>
>>>
>>> Error: Java heap space
>>>
>>>
>>>        0
>>>
>>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000001>
>>> task_201010081314_0010_r_000002
>>>
>>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002>
>>>      0.00%
>>>
>>>
>>>        8-Oct-2010 14:15:58
>>>
>>>
>>>
>>> java.io.IOException: Task process exit with nonzero status of 1.
>>>        at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
>>>
>>>
>>>
>>>        0
>>>
>>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000002>
>>> task_201010081314_0010_r_000003
>>>
>>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003>
>>>      9.72%
>>>
>>>
>>>
>>>        reduce>  copy (7 of 24 at 0.01 MB/s)>
>>>        8-Oct-2010 14:16:58
>>>
>>>
>>>        7
>>>
>>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000003>
>>> task_201010081314_0010_r_000004
>>>
>>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004>
>>>      0.00%
>>>
>>>
>>>        8-Oct-2010 14:18:11
>>>
>>>
>>>
>>> Error: GC overhead limit exceeded
>>>
>>>
>>>        0
>>>
>>> <http://prog7.lan:50030/taskstats.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000004>
>>> task_201010081314_0010_r_000005
>>>
>>> <http://prog7.lan:50030/taskdetails.jsp?jobid=job_201010081314_0010&tipid=task_201010081314_0010_r_000005>
>>>      0.00%
>>>
>>>
>>>        8-Oct-2010 14:18:56
>>>
>>>
>>>
>>> Error: GC overhead limit exceeded
>>>
>>>
>>>
>>>
>>>
>>>
>>> Seems like it runs out of memory... Which parameter should be increased?
>>>
>>> -Vincent
>>>
>>>
>>> On 10/08/2010 01:12 PM, Jeff Zhang wrote:
>>>>
>>>> BTW, you can look at the job tracker web ui to see which part of the
>>>> job cost the most of the time
>>>>
>>>>
>>>>
>>>> On Fri, Oct 8, 2010 at 5:11 PM, Jeff Zhang<zjf...@gmail.com>    wrote:
>>>>>
>>>>> No I mean whether your mapreduce job's reduce task number is 1.
>>>>>
>>>>> And could you share your pig script, then others can really understand
>>>>> your problem.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 8, 2010 at 5:04 PM, Vincent<vincent.hervi...@gmail.com>
>>>>>  wrote:
>>>>>>
>>>>>>  You are right, I didn't change this parameter, therefore the default
>>>>>> is
>>>>>> used from src/mapred/mapred-default.xml
>>>>>>
>>>>>> <property>
>>>>>> <name>mapred.reduce.tasks</name>
>>>>>> <value>1</value>
>>>>>> <description>The default number of reduce tasks per job. Typically set
>>>>>> to
>>>>>> 99%
>>>>>>  of the cluster's reduce capacity, so that if a node fails the reduces
>>>>>> can
>>>>>>  still be executed in a single wave.
>>>>>>  Ignored when mapred.job.tracker is "local".
>>>>>> </description>
>>>>>> </property>
>>>>>>
>>>>>> Not clear for me what is the reduce capacity of my cluster :)
>>>>>>
>>>>>> On 10/08/2010 01:00 PM, Jeff Zhang wrote:
>>>>>>>
>>>>>>> I guess maybe your reduce number is 1 which cause the reduce phase
>>>>>>> very
>>>>>>> slowly.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 8, 2010 at 4:44 PM, Vincent<vincent.hervi...@gmail.com>
>>>>>>>  wrote:
>>>>>>>>
>>>>>>>>  Well I can see from the job tracker that all the jobs are done
>>>>>>>> quite
>>>>>>>> quickly expect 2 for which reduce phase goes really really slowly.
>>>>>>>>
>>>>>>>> But how can I make the parallel between a job in the Hadoop jop
>>>>>>>> tracker
>>>>>>>> (example: job_201010072150_0045) and the Pig script execution?
>>>>>>>>
>>>>>>>> And what is the most efficient: several small Pig scripts? or one
>>>>>>>> big
>>>>>>>> Pig
>>>>>>>> script? I did one big to avoid to load several time the same logs in
>>>>>>>> different scripts. Maybe it is not so good design...
>>>>>>>>
>>>>>>>> Thanks for your help.
>>>>>>>>
>>>>>>>> - Vincent
>>>>>>>>
>>>>>>>>
>>>>>>>> On 10/08/2010 11:31 AM, Vincent wrote:
>>>>>>>>>
>>>>>>>>>  I'm using pig-0.7.0 on hadoop-0.20.2.
>>>>>>>>>
>>>>>>>>> For the script, well it's more then 500 lines, I'm not sure if I
>>>>>>>>> post
>>>>>>>>> it
>>>>>>>>> here that somebody will read it till the end :-)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 10/08/2010 11:26 AM, Dmitriy Ryaboy wrote:
>>>>>>>>>>
>>>>>>>>>> What version of Pig, and what does your script look like?
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 7, 2010 at 11:48 PM,
>>>>>>>>>> Vincent<vincent.hervi...@gmail.com>
>>>>>>>>>>  wrote:
>>>>>>>>>>
>>>>>>>>>>>  Hi All,
>>>>>>>>>>>
>>>>>>>>>>> I'm quite new to Pig/Hadoop. So maybe my cluster size will make
>>>>>>>>>>> you
>>>>>>>>>>> laugh.
>>>>>>>>>>>
>>>>>>>>>>> I wrote a script on Pig handling 1.5GB of logs in less than one
>>>>>>>>>>> hour
>>>>>>>>>>> in
>>>>>>>>>>> pig
>>>>>>>>>>> local mode on a Intel core 2 duo with 3GB of RAM.
>>>>>>>>>>>
>>>>>>>>>>> Then I tried this script on a simple 2 nodes cluster. These 2
>>>>>>>>>>> nodes
>>>>>>>>>>> are
>>>>>>>>>>> not
>>>>>>>>>>> servers but simple computers:
>>>>>>>>>>> - Intel core 2 duo with 3GB of RAM.
>>>>>>>>>>> - Intel Quad with 4GB of RAM.
>>>>>>>>>>>
>>>>>>>>>>> Well I was aware that hadoop has overhead and that it won't be
>>>>>>>>>>> done
>>>>>>>>>>> in
>>>>>>>>>>> half
>>>>>>>>>>> an hour (time in local divided by number of nodes). But I was
>>>>>>>>>>> surprised
>>>>>>>>>>> to
>>>>>>>>>>> see this morning it took 7 hours to complete!!!
>>>>>>>>>>>
>>>>>>>>>>> My configuration was made according to this link:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>>>>>>>>>>>
>>>>>>>>>>> My question is simple: Is it normal?
>>>>>>>>>>>
>>>>>>>>>>> Cheers
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Vincent
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>
>>>>> --
>>>>> Best Regards
>>>>>
>>>>> Jeff Zhang
>>>>>
>>>>
>>>
>>
>>
>
>



-- 
Best Regards

Jeff Zhang

Reply via email to