Bryan, the message

2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200902061117_3388/
attempt_200902061117_3388_r_000066_0/output/file.out in any of the
configured local directories

is spurious. That had been reported in 
https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in the 
trunk. I guess I should commit that fix to the 0.20 and 0.19 branches too. 
Meanwhile, please apply the patch on your repository if you can.
Regarding the tasks timing out, do you know whether the reduce tasks were in 
the shuffle phase or the reducer phase? That you can deduce by looking at the 
task web UI for the failed tasks, or, the task logs.
Also, from your reduce method, do you ensure that progress reports are sent 
every so often? By default, progress reports are sent for every record-group 
that the reducer method is invoked with, and, for every record that the reducer 
emits. If the timeout is not happening in the shuffle, then the problematic 
part is the reduce method where the timeout could be happening because a lot of 
time is spent in the processing of a particular record-group, or, the write of 
the output record to the hdfs is taking a long time.


On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote:

(Repost from the dev list)

I noticed some really odd behavior today while reviewing the job
history of some of our jobs. Our Ganglia graphs showed really long
periods of inactivity across the entire cluster, which should
definitely not be the case - we have a really long string of jobs in
our workflow that should execute one after another. I figured out
which jobs were running during those periods of inactivity, and
discovered that almost all of them had 4-5 failed reduce tasks, with
the reason for failure being something like:

Task attempt_200902061117_3382_r_000038_0 failed to report status for
1282 seconds. Killing!

The actual timeout reported varies from 700-5000 seconds. Virtually
all of our longer-running jobs were affected by this problem. The
period of inactivity on the cluster seems to correspond to the amount
of time the job waited for these reduce tasks to fail.

I checked out the tasktracker log for the machines with timed-out
reduce tasks looking for something that might explain the problem,
but the only thing I came up with that actually referenced the failed
task was this log message, which was repeated many times:

2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
taskTracker/jobcache/job_200902061117_3388/
attempt_200902061117_3388_r_000066_0/output/file.out in any of the
configured local directories

I'm not sure what this means; can anyone shed some light on this
message?

Further confusing the issue, on the affected machines, I looked in
logs/userlogs/<task id>, and to my surprise, the directory and log
files existed, and the syslog file seemed to contain logs of a
perfectly good reduce task!

Overall, this seems like a pretty critical bug. It's consuming up to
50% of the runtime of our jobs in some instances, killing our
throughput. At the very least, it seems like the reduce task timeout
period should be MUCH shorter than the current 10-20 minutes.

-Bryan

Reply via email to