I agree with the timeout period, Bryan, Reporter has a progress() method to tell the namenode that it's still working, no need to kill the job.
2009/2/21 Bryan Duxbury <br...@rapleaf.com> > We didn't customize this value, to my knowledge, so I'd suspect it's the > default. > -Bryan > > > On Feb 20, 2009, at 5:00 PM, Ted Dunning wrote: > > How often do your reduce tasks report status? >> >> On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com> wrote: >> >> (Repost from the dev list) >>> >>> >>> I noticed some really odd behavior today while reviewing the job history >>> of >>> some of our jobs. Our Ganglia graphs showed really long periods of >>> inactivity across the entire cluster, which should definitely not be the >>> case - we have a really long string of jobs in our workflow that should >>> execute one after another. I figured out which jobs were running during >>> those periods of inactivity, and discovered that almost all of them had >>> 4-5 >>> failed reduce tasks, with the reason for failure being something like: >>> >>> Task attempt_200902061117_3382_r_000038_0 failed to report status for >>> 1282 >>> seconds. Killing! >>> >>> The actual timeout reported varies from 700-5000 seconds. Virtually all >>> of >>> our longer-running jobs were affected by this problem. The period of >>> inactivity on the cluster seems to correspond to the amount of time the >>> job >>> waited for these reduce tasks to fail. >>> >>> I checked out the tasktracker log for the machines with timed-out reduce >>> tasks looking for something that might explain the problem, but the only >>> thing I came up with that actually referenced the failed task was this >>> log >>> message, which was repeated many times: >>> >>> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> >>> taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out >>> in any of the configured local directories >>> >>> I'm not sure what this means; can anyone shed some light on this message? >>> >>> Further confusing the issue, on the affected machines, I looked in >>> logs/userlogs/<task id>, and to my surprise, the directory and log files >>> existed, and the syslog file seemed to contain logs of a perfectly good >>> reduce task! >>> >>> Overall, this seems like a pretty critical bug. It's consuming up to 50% >>> of >>> the runtime of our jobs in some instances, killing our throughput. At the >>> very least, it seems like the reduce task timeout period should be MUCH >>> shorter than the current 10-20 minutes. >>> >>> -Bryan >>> >>> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> >> 111 West Evelyn Ave. Ste. 202 >> Sunnyvale, CA 94086 >> www.deepdyve.com >> 408-773-0110 ext. 738 >> 858-414-0013 (m) >> 408-773-0220 (fax) >> > > -- M. Raşit ÖZDAŞ