I'm running a job that one stage with about 60k tasks. The stage was going pretty well until many of the executors were not running any tasks at around 35k tasks finished. It came to the point where only 4 executors are working on data, all 4 executors are running on the same host. With about 25k tasks still pending, running on only 32 cores (4x8) is quite a huge slowdown (full cluster is 288 cores).
My input format is a simple CombineFileInputFormat. I'm not sure what would cause this behavior to occur. Any thoughts on how I can figure out what is happening? Thanks, Pradeep