Hi Jeroen, I experienced a similar issue a few weeks ago. The situation was a result of a mix of speculative execution and OOM issues in the container.
First of all, when an executor takes too much time in Spark, it is handled by the YARN speculative execution, which will launch a new executor and allocate it in a new container. In our case, some tasks were throwing OOM exceptions while executing, but not on the executor itself, *but on the YARN container.* It turns out that YARN will try several times to run an application when something fails in it. Specifically, it will try *yarn.resourcemanager.am.max-attempts* times to run the application before failing, which has a default value of 2 and is not modified in EMR configurations (check here <https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml> ). We've managed to check that when we have speculative execution enabled and some YARN containers which were running speculative tasks died, they did take a chance from the *max-attempts *number. This wouldn't represent any issue in normal behavior, but it seems that if all the retries were consumed in a task that has started speculative execution, the application itself doesn't fail, but it hangs the task expecting to reschedule it sometime. As the attempts are zero, it never reschedules it and the application itself fails to finish. I checked this theory repeatedly, always getting the expected results. Several times I changed the named YARN configuration and it always starts speculative retries on this task and hangs when reaching max-attempts number of broken YARN containers. I personally think that this issue should be possible to reproduce without the speculative execution configured. So, what would I do if I were you? 1. Check the number of tasks scheduled. If you see one (or more) tasks missing when you do the final sum, then you might be encountering this issue. 2. Check the *container* logs to see if anything broke. OOM is what failed to me. 3. Contact AWS EMR support, although in my experience they were of no help at all. Hope this helps you a bit! 2017-12-28 14:57 GMT-03:00 Jeroen Miller <bluedasya...@gmail.com>: > On 28 Dec 2017, at 17:41, Richard Qiao <richardqiao2...@gmail.com> wrote: > > Are you able to specify which path of data filled up? > > I can narrow it down to a bunch of files but it's not so straightforward. > > > Any logs not rolled over? > > I have to manually terminate the cluster but there is nothing more in the > driver's log when I check it from the AWS console when the cluster is still > running. > > JM > > > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > >