Re: Super-long reduce task timeouts in hadoop-0.19.0

John Lee Thu, 26 Mar 2009 09:38:32 -0700

Sorry, I should add the TaskTracker log messages I'm seeing relating
to such a hung task


2009-03-26 08:16:32,152 INFO org.apache.hadoop.mapred.TaskTracker:
LaunchTaskAction (registerTask): attempt_200903111131_0766_r_000007_0
2009-03-26 08:16:32,986 INFO org.apache.hadoop.mapred.TaskTracker:
Trying to launch : attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,146 INFO org.apache.hadoop.mapred.TaskTracker: In
TaskLauncher, current free slots : 1 and trying to launch
attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,361 INFO org.apache.hadoop.mapred.TaskTracker:
attempt_200903111131_0766_r_000007_0: Task
attempt_200903111131_0766_r_000007_0 failed to report status for 2127
seconds. Killing!
2009-03-26 08:51:59,374 INFO org.apache.hadoop.mapred.TaskTracker:
About to purge task: attempt_200903111131_0766_r_000007_0
2009-03-26 08:51:59,375 INFO org.apache.hadoop.mapred.TaskRunner:
attempt_200903111131_0766_r_000007_0 done; removing files.
2009-03-26 08:51:59,464 WARN org.apache.hadoop.mapred.TaskTracker:
Unknown child task finshed: attempt_200903111131_0766_r_000007_0.
Ignored.

And the relevant JobTracker logs.

2009-03-26 08:16:32,150 INFO org.apache.hadoop.mapred.JobTracker:
Adding task 'attempt_200903111131_0766_r_000007_0' to tip
task_200903111131_0766_r_000007, for tracker
'tracker_dev-01.33across.com:localhost/
27.0.0.1:58102'
2009-03-26 08:52:04,365 INFO org.apache.hadoop.mapred.TaskInProgress:
Error from attempt_200903111131_0766_r_000007_0: Task
attempt_200903111131_0766_r_000007_0 failed to report status for 2127
seconds. Killi
g!
2009-03-26 08:52:04,367 INFO org.apache.hadoop.mapred.JobTracker:
Removed completed task 'attempt_200903111131_0766_r_000007_0' from
'tracker_dev-01.33across.com:localhost/127.0.0.1:58102'


On Thu, Mar 26, 2009 at 11:51 AM, John Lee <j.benlin....@gmail.com> wrote:
> I have this same issue re: lots of failed reduce tasks.
>
> From the WebUI, it looks like the jobs are failing in the shuffle
> phase. The shuffle phase for the failed attempts took about a third
> the time of the successful attempts.
>
> I have also noted that in 0.19.0, my reduces often get started but
> then remained in the "unassigned" state for a long time before timing
> out. No evidence of these tasks in the local taskTracker dir.
>
> The latter problem sounds like HADOOP-5407, but is the former problem
> (reduces timing out) just a secondary symptom of HADOOP-5407? My
> TaskTrackers aren't hanging, though (other reduce tasks in the same
> job run to completion).
>
> - John
>
> On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das <d...@yahoo-inc.com> wrote:
>> Bryan, the message
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
>> configured local directories
>>
>> is spurious. That had been reported in 
>> https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in 
>> the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches 
>> too. Meanwhile, please apply the patch on your repository if you can.
>> Regarding the tasks timing out, do you know whether the reduce tasks were in 
>> the shuffle phase or the reducer phase? That you can deduce by looking at 
>> the task web UI for the failed tasks, or, the task logs.
>> Also, from your reduce method, do you ensure that progress reports are sent 
>> every so often? By default, progress reports are sent for every record-group 
>> that the reducer method is invoked with, and, for every record that the 
>> reducer emits. If the timeout is not happening in the shuffle, then the 
>> problematic part is the reduce method where the timeout could be happening 
>> because a lot of time is spent in the processing of a particular 
>> record-group, or, the write of the output record to the hdfs is taking a 
>> long time.
>>
>>
>> On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote:
>>
>> (Repost from the dev list)
>>
>> I noticed some really odd behavior today while reviewing the job
>> history of some of our jobs. Our Ganglia graphs showed really long
>> periods of inactivity across the entire cluster, which should
>> definitely not be the case - we have a really long string of jobs in
>> our workflow that should execute one after another. I figured out
>> which jobs were running during those periods of inactivity, and
>> discovered that almost all of them had 4-5 failed reduce tasks, with
>> the reason for failure being something like:
>>
>> Task attempt_200902061117_3382_r_000038_0 failed to report status for
>> 1282 seconds. Killing!
>>
>> The actual timeout reported varies from 700-5000 seconds. Virtually
>> all of our longer-running jobs were affected by this problem. The
>> period of inactivity on the cluster seems to correspond to the amount
>> of time the job waited for these reduce tasks to fail.
>>
>> I checked out the tasktracker log for the machines with timed-out
>> reduce tasks looking for something that might explain the problem,
>> but the only thing I came up with that actually referenced the failed
>> task was this log message, which was repeated many times:
>>
>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
>> taskTracker/jobcache/job_200902061117_3388/
>> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
>> configured local directories
>>
>> I'm not sure what this means; can anyone shed some light on this
>> message?
>>
>> Further confusing the issue, on the affected machines, I looked in
>> logs/userlogs/<task id>, and to my surprise, the directory and log
>> files existed, and the syslog file seemed to contain logs of a
>> perfectly good reduce task!
>>
>> Overall, this seems like a pretty critical bug. It's consuming up to
>> 50% of the runtime of our jobs in some instances, killing our
>> throughput. At the very least, it seems like the reduce task timeout
>> period should be MUCH shorter than the current 10-20 minutes.
>>
>> -Bryan
>>
>>
>

Re: Super-long reduce task timeouts in hadoop-0.19.0

Reply via email to