How often do your reduce tasks report status? On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com> wrote:
> (Repost from the dev list) > > > I noticed some really odd behavior today while reviewing the job history of > some of our jobs. Our Ganglia graphs showed really long periods of > inactivity across the entire cluster, which should definitely not be the > case - we have a really long string of jobs in our workflow that should > execute one after another. I figured out which jobs were running during > those periods of inactivity, and discovered that almost all of them had 4-5 > failed reduce tasks, with the reason for failure being something like: > > Task attempt_200902061117_3382_r_000038_0 failed to report status for 1282 > seconds. Killing! > > The actual timeout reported varies from 700-5000 seconds. Virtually all of > our longer-running jobs were affected by this problem. The period of > inactivity on the cluster seems to correspond to the amount of time the job > waited for these reduce tasks to fail. > > I checked out the tasktracker log for the machines with timed-out reduce > tasks looking for something that might explain the problem, but the only > thing I came up with that actually referenced the failed task was this log > message, which was repeated many times: > > 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker: > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out > in any of the configured local directories > > I'm not sure what this means; can anyone shed some light on this message? > > Further confusing the issue, on the affected machines, I looked in > logs/userlogs/<task id>, and to my surprise, the directory and log files > existed, and the syslog file seemed to contain logs of a perfectly good > reduce task! > > Overall, this seems like a pretty critical bug. It's consuming up to 50% of > the runtime of our jobs in some instances, killing our throughput. At the > very least, it seems like the reduce task timeout period should be MUCH > shorter than the current 10-20 minutes. > > -Bryan > -- Ted Dunning, CTO DeepDyve 111 West Evelyn Ave. Ste. 202 Sunnyvale, CA 94086 www.deepdyve.com 408-773-0110 ext. 738 858-414-0013 (m) 408-773-0220 (fax)