I have this same issue re: lots of failed reduce tasks. >From the WebUI, it looks like the jobs are failing in the shuffle phase. The shuffle phase for the failed attempts took about a third the time of the successful attempts.
I have also noted that in 0.19.0, my reduces often get started but then remained in the "unassigned" state for a long time before timing out. No evidence of these tasks in the local taskTracker dir. The latter problem sounds like HADOOP-5407, but is the former problem (reduces timing out) just a secondary symptom of HADOOP-5407? My TaskTrackers aren't hanging, though (other reduce tasks in the same job run to completion). - John On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das <d...@yahoo-inc.com> wrote: > Bryan, the message > > 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker: > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > taskTracker/jobcache/job_200902061117_3388/ > attempt_200902061117_3388_r_000066_0/output/file.out in any of the > configured local directories > > is spurious. That had been reported in > https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the > trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. > Meanwhile, please apply the patch on your repository if you can. > Regarding the tasks timing out, do you know whether the reduce tasks were in > the shuffle phase or the reducer phase? That you can deduce by looking at the > task web UI for the failed tasks, or, the task logs. > Also, from your reduce method, do you ensure that progress reports are sent > every so often? By default, progress reports are sent for every record-group > that the reducer method is invoked with, and, for every record that the > reducer emits. If the timeout is not happening in the shuffle, then the > problematic part is the reduce method where the timeout could be happening > because a lot of time is spent in the processing of a particular > record-group, or, the write of the output record to the hdfs is taking a long > time. > > > On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote: > > (Repost from the dev list) > > I noticed some really odd behavior today while reviewing the job > history of some of our jobs. Our Ganglia graphs showed really long > periods of inactivity across the entire cluster, which should > definitely not be the case - we have a really long string of jobs in > our workflow that should execute one after another. I figured out > which jobs were running during those periods of inactivity, and > discovered that almost all of them had 4-5 failed reduce tasks, with > the reason for failure being something like: > > Task attempt_200902061117_3382_r_000038_0 failed to report status for > 1282 seconds. Killing! > > The actual timeout reported varies from 700-5000 seconds. Virtually > all of our longer-running jobs were affected by this problem. The > period of inactivity on the cluster seems to correspond to the amount > of time the job waited for these reduce tasks to fail. > > I checked out the tasktracker log for the machines with timed-out > reduce tasks looking for something that might explain the problem, > but the only thing I came up with that actually referenced the failed > task was this log message, which was repeated many times: > > 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker: > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > taskTracker/jobcache/job_200902061117_3388/ > attempt_200902061117_3388_r_000066_0/output/file.out in any of the > configured local directories > > I'm not sure what this means; can anyone shed some light on this > message? > > Further confusing the issue, on the affected machines, I looked in > logs/userlogs/<task id>, and to my surprise, the directory and log > files existed, and the syslog file seemed to contain logs of a > perfectly good reduce task! > > Overall, this seems like a pretty critical bug. It's consuming up to > 50% of the runtime of our jobs in some instances, killing our > throughput. At the very least, it seems like the reduce task timeout > period should be MUCH shorter than the current 10-20 minutes. > > -Bryan > >