I agree with the timeout period, Bryan,
Reporter has a progress() method to tell the namenode that it's still
working, no need to kill the job.


2009/2/21 Bryan Duxbury <br...@rapleaf.com>

> We didn't customize this value, to my knowledge, so I'd suspect it's the
> default.
> -Bryan
>
>
> On Feb 20, 2009, at 5:00 PM, Ted Dunning wrote:
>
>  How often do your reduce tasks report status?
>>
>> On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
>>
>>  (Repost from the dev list)
>>>
>>>
>>> I noticed some really odd behavior today while reviewing the job history
>>> of
>>> some of our jobs. Our Ganglia graphs showed really long periods of
>>> inactivity across the entire cluster, which should definitely not be the
>>> case - we have a really long string of jobs in our workflow that should
>>> execute one after another. I figured out which jobs were running during
>>> those periods of inactivity, and discovered that almost all of them had
>>> 4-5
>>> failed reduce tasks, with the reason for failure being something like:
>>>
>>> Task attempt_200902061117_3382_r_000038_0 failed to report status for
>>> 1282
>>> seconds. Killing!
>>>
>>> The actual timeout reported varies from 700-5000 seconds. Virtually all
>>> of
>>> our longer-running jobs were affected by this problem. The period of
>>> inactivity on the cluster seems to correspond to the amount of time the
>>> job
>>> waited for these reduce tasks to fail.
>>>
>>> I checked out the tasktracker log for the machines with timed-out reduce
>>> tasks looking for something that might explain the problem, but the only
>>> thing I came up with that actually referenced the failed task was this
>>> log
>>> message, which was repeated many times:
>>>
>>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>>>
>>> taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out
>>> in any of the configured local directories
>>>
>>> I'm not sure what this means; can anyone shed some light on this message?
>>>
>>> Further confusing the issue, on the affected machines, I looked in
>>> logs/userlogs/<task id>, and to my surprise, the directory and log files
>>> existed, and the syslog file seemed to contain logs of a perfectly good
>>> reduce task!
>>>
>>> Overall, this seems like a pretty critical bug. It's consuming up to 50%
>>> of
>>> the runtime of our jobs in some instances, killing our throughput. At the
>>> very least, it seems like the reduce task timeout period should be MUCH
>>> shorter than the current 10-20 minutes.
>>>
>>> -Bryan
>>>
>>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>>
>> 111 West Evelyn Ave. Ste. 202
>> Sunnyvale, CA 94086
>> www.deepdyve.com
>> 408-773-0110 ext. 738
>> 858-414-0013 (m)
>> 408-773-0220 (fax)
>>
>
>


-- 
M. Raşit ÖZDAŞ

Reply via email to