Hi Ufuk, sorry for taking an awful lot of time to reply but I fell behind with the ML in the last couple of weeks due to lack of time. First of all, thanks for taking the time to help me.
Yes, what I was saying was that apparently from the code (and effectively as we later found out after a couple of tests) the "upper bound" cited by the documentation seems invalid (meaning, the number of attempts if effectively regulated by the flink-conf.yaml and falls back to the yarn-site.xml only if missing). We're currently using Hadoop 2.7.1 so we'll try your solution, thanks. I was also wondering if there's a way to ask to retry indefinitely, so that a long-running streaming job can endure as many job manager failures as possible without ever needing for a human intervention to restart the YARN session. On Sat, Apr 2, 2016 at 3:53 PM, Ufuk Celebi <u...@apache.org> wrote: > Hey Stefano, > > yarn.resourcemanager.am.max-attempts is a setting for your YARN > cluster and cannot be influenced by Flink. Flink cannot set a higher > number than this for yarn.application-attempts. > > The key that is set/overriden by Flink is probably only valid for the > YARN session, but I'm not too familiar with the code. Maybe someone > else can chime in. > > I would recommend using a newer Hadoop version (>= 2.6), where you can > configure the failure validity interval, which counts the attempts per > time interval, e.g. it is allowed to fail 2 times within X seconds. > Per default, the failure validity interval is configured to the Akka > timeout (which is per default 10s). I actually think it would make > sense to increase this a little and leave the attempts at 1 or 2 (in > the interval). > > Does this help? > > – Ufuk > > > On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino > <stefano.bagh...@radicalbit.io> wrote: > > Hello everybody, > > > > I was asking myself: are there any best practices regarding how to set > the > > `yarn.application-attempts` configuration key when running Flink on YARN > as > > a long-running session? The configuration page on the docs states that 1 > is > > the default and that it is recommended to leave it like that, however in > the > > case of a long running session it seems to me that the value should be > > higher in order to actually allow the session to keep running despite Job > > Managers failing. > > > > Furthermore, the HA page on the docs states the following > > > > """ > > It’s important to note that yarn.resourcemanager.am.max-attempts is an > upper > > bound for the application restarts. Therfore, the number of application > > attempts set within Flink cannot exceed the YARN cluster setting with > which > > YARN was started. > > """ > > > > However, after some tests conducted by my colleagues and after looking at > > the code (FlinkYarnClientBase:522-536) it seems to me that the > > flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn > > overrides the fallback value of 1. Is this right? Is the documentation > > wrong? > > > > -- > > BR, > > Stefano Baghino > > > > Software Engineer @ Radicalbit > -- BR, Stefano Baghino Software Engineer @ Radicalbit