Hi Stefano, Hadoop supports this feature since version 2.6.0. You can define a time interval for the maximum number of applications attempt. This means that you have to observe this number of application failures in a time interval before failing the application ultimately. Flink will activate this feature if you’re using Hadoop >= 2.6.0. The failures validity interval will be set to the akka.ask.timeout value (default: 10s).
[1] https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.html#setAttemptFailuresValidityInterval(long ) Cheers, Till On Tue, Apr 12, 2016 at 11:56 AM, Stefano Baghino < stefano.bagh...@radicalbit.io> wrote: > Hi Ufuk, sorry for taking an awful lot of time to reply but I fell behind > with the ML in the last couple of weeks due to lack of time. > First of all, thanks for taking the time to help me. > > Yes, what I was saying was that apparently from the code (and effectively > as we later found out after a couple of tests) the "upper bound" cited by > the documentation seems invalid (meaning, the number of attempts if > effectively regulated by the flink-conf.yaml and falls back to the > yarn-site.xml only if missing). > > We're currently using Hadoop 2.7.1 so we'll try your solution, thanks. > > I was also wondering if there's a way to ask to retry indefinitely, so > that a long-running streaming job can endure as many job manager failures > as possible without ever needing for a human intervention to restart the > YARN session. > > On Sat, Apr 2, 2016 at 3:53 PM, Ufuk Celebi <u...@apache.org> wrote: > >> Hey Stefano, >> >> yarn.resourcemanager.am.max-attempts is a setting for your YARN >> cluster and cannot be influenced by Flink. Flink cannot set a higher >> number than this for yarn.application-attempts. >> >> The key that is set/overriden by Flink is probably only valid for the >> YARN session, but I'm not too familiar with the code. Maybe someone >> else can chime in. >> >> I would recommend using a newer Hadoop version (>= 2.6), where you can >> configure the failure validity interval, which counts the attempts per >> time interval, e.g. it is allowed to fail 2 times within X seconds. >> Per default, the failure validity interval is configured to the Akka >> timeout (which is per default 10s). I actually think it would make >> sense to increase this a little and leave the attempts at 1 or 2 (in >> the interval). >> >> Does this help? >> >> – Ufuk >> >> >> On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino >> <stefano.bagh...@radicalbit.io> wrote: >> > Hello everybody, >> > >> > I was asking myself: are there any best practices regarding how to set >> the >> > `yarn.application-attempts` configuration key when running Flink on >> YARN as >> > a long-running session? The configuration page on the docs states that >> 1 is >> > the default and that it is recommended to leave it like that, however >> in the >> > case of a long running session it seems to me that the value should be >> > higher in order to actually allow the session to keep running despite >> Job >> > Managers failing. >> > >> > Furthermore, the HA page on the docs states the following >> > >> > """ >> > It’s important to note that yarn.resourcemanager.am.max-attempts is an >> upper >> > bound for the application restarts. Therfore, the number of application >> > attempts set within Flink cannot exceed the YARN cluster setting with >> which >> > YARN was started. >> > """ >> > >> > However, after some tests conducted by my colleagues and after looking >> at >> > the code (FlinkYarnClientBase:522-536) it seems to me that the >> > flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn >> > overrides the fallback value of 1. Is this right? Is the documentation >> > wrong? >> > >> > -- >> > BR, >> > Stefano Baghino >> > >> > Software Engineer @ Radicalbit >> > > > > -- > BR, > Stefano Baghino > > Software Engineer @ Radicalbit >