I'm running a job that one stage with about 60k tasks. The stage was going
pretty well until many of the executors were not running any tasks at
around 35k tasks finished. It came to the point where only 4 executors are
working on data, all 4 executors are running on the same host. With about
25k tasks still pending, running on only 32 cores (4x8) is quite a huge
slowdown (full cluster is 288 cores).

My input format is a simple CombineFileInputFormat. I'm not sure what would
cause this behavior to occur. Any thoughts on how I can figure out what is
happening?

Thanks,
Pradeep

Reply via email to