How often do your reduce tasks report status?

On Fri, Feb 20, 2009 at 3:58 PM, Bryan Duxbury <br...@rapleaf.com> wrote:

> (Repost from the dev list)
>
>
> I noticed some really odd behavior today while reviewing the job history of
> some of our jobs. Our Ganglia graphs showed really long periods of
> inactivity across the entire cluster, which should definitely not be the
> case - we have a really long string of jobs in our workflow that should
> execute one after another. I figured out which jobs were running during
> those periods of inactivity, and discovered that almost all of them had 4-5
> failed reduce tasks, with the reason for failure being something like:
>
> Task attempt_200902061117_3382_r_000038_0 failed to report status for 1282
> seconds. Killing!
>
> The actual timeout reported varies from 700-5000 seconds. Virtually all of
> our longer-running jobs were affected by this problem. The period of
> inactivity on the cluster seems to correspond to the amount of time the job
> waited for these reduce tasks to fail.
>
> I checked out the tasktracker log for the machines with timed-out reduce
> tasks looking for something that might explain the problem, but the only
> thing I came up with that actually referenced the failed task was this log
> message, which was repeated many times:
>
> 2009-02-19 22:48:19,380 INFO org.apache.hadoop.mapred.TaskTracker:
> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find
> taskTracker/jobcache/job_200902061117_3388/attempt_200902061117_3388_r_000066_0/output/file.out
> in any of the configured local directories
>
> I'm not sure what this means; can anyone shed some light on this message?
>
> Further confusing the issue, on the affected machines, I looked in
> logs/userlogs/<task id>, and to my surprise, the directory and log files
> existed, and the syslog file seemed to contain logs of a perfectly good
> reduce task!
>
> Overall, this seems like a pretty critical bug. It's consuming up to 50% of
> the runtime of our jobs in some instances, killing our throughput. At the
> very least, it seems like the reduce task timeout period should be MUCH
> shorter than the current 10-20 minutes.
>
> -Bryan
>



-- 
Ted Dunning, CTO
DeepDyve

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
www.deepdyve.com
408-773-0110 ext. 738
858-414-0013 (m)
408-773-0220 (fax)

Reply via email to