Hi Ufuk, sorry for taking an awful lot of time to reply but I fell behind
with the ML in the last couple of weeks due to lack of time.
First of all, thanks for taking the time to help me.

Yes, what I was saying was that apparently from the code (and effectively
as we later found out after a couple of tests) the "upper bound" cited by
the documentation seems invalid (meaning, the number of attempts if
effectively regulated by the flink-conf.yaml and falls back to the
yarn-site.xml only if missing).

We're currently using Hadoop 2.7.1 so we'll try your solution, thanks.

I was also wondering if there's a way to ask to retry indefinitely, so that
a long-running streaming job can endure as many job manager failures as
possible without ever needing for a human intervention to restart the YARN
session.

On Sat, Apr 2, 2016 at 3:53 PM, Ufuk Celebi <u...@apache.org> wrote:

> Hey Stefano,
>
> yarn.resourcemanager.am.max-attempts is a setting for your YARN
> cluster and cannot be influenced by Flink. Flink cannot set a higher
> number than this for yarn.application-attempts.
>
> The key that is set/overriden by Flink is probably only valid for the
> YARN session, but I'm not too familiar with the code. Maybe someone
> else can chime in.
>
> I would recommend using a newer Hadoop version (>= 2.6), where you can
> configure the failure validity interval, which counts the attempts per
> time interval, e.g. it is allowed to fail 2 times within X seconds.
> Per default, the failure validity interval is configured to the Akka
> timeout (which is per default 10s). I actually think it would make
> sense to increase this a little and leave the attempts at 1 or 2 (in
> the interval).
>
> Does this help?
>
> – Ufuk
>
>
> On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino
> <stefano.bagh...@radicalbit.io> wrote:
> > Hello everybody,
> >
> > I was asking myself: are there any best practices regarding how to set
> the
> > `yarn.application-attempts` configuration key when running Flink on YARN
> as
> > a long-running session? The configuration page on the docs states that 1
> is
> > the default and that it is recommended to leave it like that, however in
> the
> > case of a long running session it seems to me that the value should be
> > higher in order to actually allow the session to keep running despite Job
> > Managers failing.
> >
> > Furthermore, the HA page on the docs states the following
> >
> > """
> > It’s important to note that yarn.resourcemanager.am.max-attempts is an
> upper
> > bound for the application restarts. Therfore, the number of application
> > attempts set within Flink cannot exceed the YARN cluster setting with
> which
> > YARN was started.
> > """
> >
> > However, after some tests conducted by my colleagues and after looking at
> > the code (FlinkYarnClientBase:522-536) it seems to me that the
> > flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn
> > overrides the fallback value of 1. Is this right? Is the documentation
> > wrong?
> >
> > --
> > BR,
> > Stefano Baghino
> >
> > Software Engineer @ Radicalbit
>



-- 
BR,
Stefano Baghino

Software Engineer @ Radicalbit

Reply via email to