Sorry, I should add the TaskTracker log messages I'm seeing relating to such a hung task
2009-03-26 08:16:32,152 INFO org.apache.hadoop.mapred.TaskTracker: LaunchTaskAction (registerTask): attempt_200903111131_0766_r_000007_0 2009-03-26 08:16:32,986 INFO org.apache.hadoop.mapred.TaskTracker: Trying to launch : attempt_200903111131_0766_r_000007_0 2009-03-26 08:51:59,146 INFO org.apache.hadoop.mapred.TaskTracker: In TaskLauncher, current free slots : 1 and trying to launch attempt_200903111131_0766_r_000007_0 2009-03-26 08:51:59,361 INFO org.apache.hadoop.mapred.TaskTracker: attempt_200903111131_0766_r_000007_0: Task attempt_200903111131_0766_r_000007_0 failed to report status for 2127 seconds. Killing! 2009-03-26 08:51:59,374 INFO org.apache.hadoop.mapred.TaskTracker: About to purge task: attempt_200903111131_0766_r_000007_0 2009-03-26 08:51:59,375 INFO org.apache.hadoop.mapred.TaskRunner: attempt_200903111131_0766_r_000007_0 done; removing files. 2009-03-26 08:51:59,464 WARN org.apache.hadoop.mapred.TaskTracker: Unknown child task finshed: attempt_200903111131_0766_r_000007_0. Ignored. And the relevant JobTracker logs. 2009-03-26 08:16:32,150 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'attempt_200903111131_0766_r_000007_0' to tip task_200903111131_0766_r_000007, for tracker 'tracker_dev-01.33across.com:localhost/ 27.0.0.1:58102' 2009-03-26 08:52:04,365 INFO org.apache.hadoop.mapred.TaskInProgress: Error from attempt_200903111131_0766_r_000007_0: Task attempt_200903111131_0766_r_000007_0 failed to report status for 2127 seconds. Killi g! 2009-03-26 08:52:04,367 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200903111131_0766_r_000007_0' from 'tracker_dev-01.33across.com:localhost/127.0.0.1:58102' On Thu, Mar 26, 2009 at 11:51 AM, John Lee <j.benlin....@gmail.com> wrote: > I have this same issue re: lots of failed reduce tasks. > > From the WebUI, it looks like the jobs are failing in the shuffle > phase. The shuffle phase for the failed attempts took about a third > the time of the successful attempts. > > I have also noted that in 0.19.0, my reduces often get started but > then remained in the "unassigned" state for a long time before timing > out. No evidence of these tasks in the local taskTracker dir. > > The latter problem sounds like HADOOP-5407, but is the former problem > (reduces timing out) just a secondary symptom of HADOOP-5407? My > TaskTrackers aren't hanging, though (other reduce tasks in the same > job run to completion). > > - John > > On Sun, Feb 22, 2009 at 3:04 AM, Devaraj Das <d...@yahoo-inc.com> wrote: >> Bryan, the message >> >> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker: >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >> taskTracker/jobcache/job_200902061117_3388/ >> attempt_200902061117_3388_r_000066_0/output/file.out in any of the >> configured local directories >> >> is spurious. That had been reported in >> https://issues.apache.org/jira/browse/HADOOP-4963 and the fix is there in >> the trunk. I guess I should commit that fix to the 0.20 and 0.19 branches >> too. Meanwhile, please apply the patch on your repository if you can. >> Regarding the tasks timing out, do you know whether the reduce tasks were in >> the shuffle phase or the reducer phase? That you can deduce by looking at >> the task web UI for the failed tasks, or, the task logs. >> Also, from your reduce method, do you ensure that progress reports are sent >> every so often? By default, progress reports are sent for every record-group >> that the reducer method is invoked with, and, for every record that the >> reducer emits. If the timeout is not happening in the shuffle, then the >> problematic part is the reduce method where the timeout could be happening >> because a lot of time is spent in the processing of a particular >> record-group, or, the write of the output record to the hdfs is taking a >> long time. >> >> >> On 2/21/09 5:28 AM, "Bryan Duxbury" <br...@rapleaf.com> wrote: >> >> (Repost from the dev list) >> >> I noticed some really odd behavior today while reviewing the job >> history of some of our jobs. Our Ganglia graphs showed really long >> periods of inactivity across the entire cluster, which should >> definitely not be the case - we have a really long string of jobs in >> our workflow that should execute one after another. I figured out >> which jobs were running during those periods of inactivity, and >> discovered that almost all of them had 4-5 failed reduce tasks, with >> the reason for failure being something like: >> >> Task attempt_200902061117_3382_r_000038_0 failed to report status for >> 1282 seconds. Killing! >> >> The actual timeout reported varies from 700-5000 seconds. Virtually >> all of our longer-running jobs were affected by this problem. The >> period of inactivity on the cluster seems to correspond to the amount >> of time the job waited for these reduce tasks to fail. >> >> I checked out the tasktracker log for the machines with timed-out >> reduce tasks looking for something that might explain the problem, >> but the only thing I came up with that actually referenced the failed >> task was this log message, which was repeated many times: >> >> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker: >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >> taskTracker/jobcache/job_200902061117_3388/ >> attempt_200902061117_3388_r_000066_0/output/file.out in any of the >> configured local directories >> >> I'm not sure what this means; can anyone shed some light on this >> message? >> >> Further confusing the issue, on the affected machines, I looked in >> logs/userlogs/<task id>, and to my surprise, the directory and log >> files existed, and the syslog file seemed to contain logs of a >> perfectly good reduce task! >> >> Overall, this seems like a pretty critical bug. It's consuming up to >> 50% of the runtime of our jobs in some instances, killing our >> throughput. At the very least, it seems like the reduce task timeout >> period should be MUCH shorter than the current 10-20 minutes. >> >> -Bryan >> >> >