Hey Stefano,

yarn.resourcemanager.am.max-attempts is a setting for your YARN
cluster and cannot be influenced by Flink. Flink cannot set a higher
number than this for yarn.application-attempts.

The key that is set/overriden by Flink is probably only valid for the
YARN session, but I'm not too familiar with the code. Maybe someone
else can chime in.

I would recommend using a newer Hadoop version (>= 2.6), where you can
configure the failure validity interval, which counts the attempts per
time interval, e.g. it is allowed to fail 2 times within X seconds.
Per default, the failure validity interval is configured to the Akka
timeout (which is per default 10s). I actually think it would make
sense to increase this a little and leave the attempts at 1 or 2 (in
the interval).

Does this help?

– Ufuk


On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino
<stefano.bagh...@radicalbit.io> wrote:
> Hello everybody,
>
> I was asking myself: are there any best practices regarding how to set the
> `yarn.application-attempts` configuration key when running Flink on YARN as
> a long-running session? The configuration page on the docs states that 1 is
> the default and that it is recommended to leave it like that, however in the
> case of a long running session it seems to me that the value should be
> higher in order to actually allow the session to keep running despite Job
> Managers failing.
>
> Furthermore, the HA page on the docs states the following
>
> """
> It’s important to note that yarn.resourcemanager.am.max-attempts is an upper
> bound for the application restarts. Therfore, the number of application
> attempts set within Flink cannot exceed the YARN cluster setting with which
> YARN was started.
> """
>
> However, after some tests conducted by my colleagues and after looking at
> the code (FlinkYarnClientBase:522-536) it seems to me that the
> flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn
> overrides the fallback value of 1. Is this right? Is the documentation
> wrong?
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit

Reply via email to