I have this same issue re: lots of failed reduce tasks.

>From the WebUI, it looks like the jobs are failing in the shuffle
phase. The shuffle phase for the failed attempts took about a third
the time of the successful attempts.

I have also noted that in 0.19.0, my reduces often get started but
then remained in the "unassigned" state for a long time before timing
out. No evidence of these tasks in the local taskTracker dir.

The latter problem sounds like HADOOP-5407, but is the former problem
(reduces timing out) just a secondary symptom of HADOOP-5407? My
TaskTrackers aren't hanging, though (other reduce tasks in the same
job run to completion).

- John

On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das <d...@yahoo-inc.com> wrote:
> Bryan, the message
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/
> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
> configured local directories
>
> is spurious. That had been reported in 
> https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the 
> trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. 
> Meanwhile, please apply the patch on your repository if you can.
> Regarding the tasks timing out, do you know whether the reduce tasks were in 
> the shuffle phase or the reducer phase? That you can deduce by looking at the 
> task web UI for the failed tasks, or, the task logs.
> Also, from your reduce method, do you ensure that progress reports are sent 
> every so often? By default, progress reports are sent for every record-group 
> that the reducer method is invoked with, and, for every record that the 
> reducer emits. If the timeout is not happening in the shuffle, then the 
> problematic part is the reduce method where the timeout could be happening 
> because a lot of time is spent in the processing of a particular 
> record-group, or, the write of the output record to the hdfs is taking a long 
> time.
>
>
> On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote:
>
> (Repost from the dev list)
>
> I noticed some really odd behavior today while reviewing the job
> history of some of our jobs. Our Ganglia graphs showed really long
> periods of inactivity across the entire cluster, which should
> definitely not be the case - we have a really long string of jobs in
> our workflow that should execute one after another. I figured out
> which jobs were running during those periods of inactivity, and
> discovered that almost all of them had 4-5 failed reduce tasks, with
> the reason for failure being something like:
>
> Task attempt_200902061117_3382_r_000038_0 failed to report status for
> 1282 seconds. Killing!
>
> The actual timeout reported varies from 700-5000 seconds. Virtually
> all of our longer-running jobs were affected by this problem. The
> period of inactivity on the cluster seems to correspond to the amount
> of time the job waited for these reduce tasks to fail.
>
> I checked out the tasktracker log for the machines with timed-out
> reduce tasks looking for something that might explain the problem,
> but the only thing I came up with that actually referenced the failed
> task was this log message, which was repeated many times:
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/
> attempt_200902061117_3388_r_000066_0/output/file.out in any of the
> configured local directories
>
> I'm not sure what this means; can anyone shed some light on this
> message?
>
> Further confusing the issue, on the affected machines, I looked in
> logs/userlogs/<task id>, and to my surprise, the directory and log
> files existed, and the syslog file seemed to contain logs of a
> perfectly good reduce task!
>
> Overall, this seems like a pretty critical bug. It's consuming up to
> 50% of the runtime of our jobs in some instances, killing our
> throughput. At the very least, it seems like the reduce task timeout
> period should be MUCH shorter than the current 10-20 minutes.
>
> -Bryan
>
>

Reply via email to