Re: Spark on EMR suddenly stalling

Maximiliano Felice Thu, 28 Dec 2017 10:41:17 -0800

Hi Jeroen,

I experienced a similar issue a few weeks ago. The situation was a result
of a mix of speculative execution and OOM issues in the container.

First of all, when an executor takes too much time in Spark, it is handled
by the YARN speculative execution, which will launch a new executor and
allocate it in a new container. In our case, some tasks were throwing OOM
exceptions while executing, but not on the executor itself, *but on the
YARN container.*

It turns out that YARN will try several times to run an application when
something fails in it. Specifically, it will try
*yarn.resourcemanager.am.max-attempts* times to run the application before
failing, which has a default value of 2 and is not modified in EMR
configurations (check here
<https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-common/yarn-default.xml>
).

We've managed to check that when we have speculative execution enabled and
some YARN containers which were running speculative tasks died, they did
take a chance from the *max-attempts *number. This wouldn't represent any
issue in normal behavior, but it seems that if all the retries were
consumed in a task that has started speculative execution, the application
itself doesn't fail, but it hangs the task expecting to reschedule it
sometime. As the attempts are zero, it never reschedules it and the
application itself fails to finish.

I checked this theory repeatedly, always getting the expected results.
Several times I changed the named YARN configuration and it always starts
speculative retries on this task and hangs when reaching max-attempts
number of broken YARN containers.

I personally think that this issue should be possible to reproduce without
the speculative execution configured.

So, what would I do if I were you?

1. Check the number of tasks scheduled. If you see one (or more) tasks
missing when you do the final sum, then you might be encountering this
issue.
2. Check the *container* logs to see if anything broke. OOM is what failed
to me.
3. Contact AWS EMR support, although in my experience they were of no help
at all.

Hope this helps you a bit!

2017-12-28 14:57 GMT-03:00 Jeroen Miller <bluedasya...@gmail.com>:

> On 28 Dec 2017, at 17:41, Richard Qiao <richardqiao2...@gmail.com> wrote:
> > Are you able to specify which path of data filled up?
>
> I can narrow it down to a bunch of files but it's not so straightforward.
>
> > Any logs not rolled over?
>
> I have to manually terminate the cluster but there is nothing more in the
> driver's log when I check it from the AWS console when the cluster is still
> running.
>
> JM
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark on EMR suddenly stalling

Reply via email to