Hey Stefano, yarn.resourcemanager.am.max-attempts is a setting for your YARN cluster and cannot be influenced by Flink. Flink cannot set a higher number than this for yarn.application-attempts.
The key that is set/overriden by Flink is probably only valid for the YARN session, but I'm not too familiar with the code. Maybe someone else can chime in. I would recommend using a newer Hadoop version (>= 2.6), where you can configure the failure validity interval, which counts the attempts per time interval, e.g. it is allowed to fail 2 times within X seconds. Per default, the failure validity interval is configured to the Akka timeout (which is per default 10s). I actually think it would make sense to increase this a little and leave the attempts at 1 or 2 (in the interval). Does this help? – Ufuk On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino <stefano.bagh...@radicalbit.io> wrote: > Hello everybody, > > I was asking myself: are there any best practices regarding how to set the > `yarn.application-attempts` configuration key when running Flink on YARN as > a long-running session? The configuration page on the docs states that 1 is > the default and that it is recommended to leave it like that, however in the > case of a long running session it seems to me that the value should be > higher in order to actually allow the session to keep running despite Job > Managers failing. > > Furthermore, the HA page on the docs states the following > > """ > It’s important to note that yarn.resourcemanager.am.max-attempts is an upper > bound for the application restarts. Therfore, the number of application > attempts set within Flink cannot exceed the YARN cluster setting with which > YARN was started. > """ > > However, after some tests conducted by my colleagues and after looking at > the code (FlinkYarnClientBase:522-536) it seems to me that the > flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn > overrides the fallback value of 1. Is this right? Is the documentation > wrong? > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit