Hi Stefano,

Hadoop supports this feature since version 2.6.0. You can define a time
interval for the maximum number of applications attempt. This means that
you have to observe this number of application failures in a time interval
before failing the application ultimately. Flink will activate this feature
if you’re using Hadoop >= 2.6.0. The failures validity interval will be set
to the akka.ask.timeout value (default: 10s).

[1]
https://hadoop.apache.org/docs/r2.6.0/api/org/apache/hadoop/yarn/api/records/ApplicationSubmissionContext.html#setAttemptFailuresValidityInterval(long
)

Cheers,
Till
​

On Tue, Apr 12, 2016 at 11:56 AM, Stefano Baghino <
stefano.bagh...@radicalbit.io> wrote:

> Hi Ufuk, sorry for taking an awful lot of time to reply but I fell behind
> with the ML in the last couple of weeks due to lack of time.
> First of all, thanks for taking the time to help me.
>
> Yes, what I was saying was that apparently from the code (and effectively
> as we later found out after a couple of tests) the "upper bound" cited by
> the documentation seems invalid (meaning, the number of attempts if
> effectively regulated by the flink-conf.yaml and falls back to the
> yarn-site.xml only if missing).
>
> We're currently using Hadoop 2.7.1 so we'll try your solution, thanks.
>
> I was also wondering if there's a way to ask to retry indefinitely, so
> that a long-running streaming job can endure as many job manager failures
> as possible without ever needing for a human intervention to restart the
> YARN session.
>
> On Sat, Apr 2, 2016 at 3:53 PM, Ufuk Celebi <u...@apache.org> wrote:
>
>> Hey Stefano,
>>
>> yarn.resourcemanager.am.max-attempts is a setting for your YARN
>> cluster and cannot be influenced by Flink. Flink cannot set a higher
>> number than this for yarn.application-attempts.
>>
>> The key that is set/overriden by Flink is probably only valid for the
>> YARN session, but I'm not too familiar with the code. Maybe someone
>> else can chime in.
>>
>> I would recommend using a newer Hadoop version (>= 2.6), where you can
>> configure the failure validity interval, which counts the attempts per
>> time interval, e.g. it is allowed to fail 2 times within X seconds.
>> Per default, the failure validity interval is configured to the Akka
>> timeout (which is per default 10s). I actually think it would make
>> sense to increase this a little and leave the attempts at 1 or 2 (in
>> the interval).
>>
>> Does this help?
>>
>> – Ufuk
>>
>>
>> On Fri, Apr 1, 2016 at 3:24 PM, Stefano Baghino
>> <stefano.bagh...@radicalbit.io> wrote:
>> > Hello everybody,
>> >
>> > I was asking myself: are there any best practices regarding how to set
>> the
>> > `yarn.application-attempts` configuration key when running Flink on
>> YARN as
>> > a long-running session? The configuration page on the docs states that
>> 1 is
>> > the default and that it is recommended to leave it like that, however
>> in the
>> > case of a long running session it seems to me that the value should be
>> > higher in order to actually allow the session to keep running despite
>> Job
>> > Managers failing.
>> >
>> > Furthermore, the HA page on the docs states the following
>> >
>> > """
>> > It’s important to note that yarn.resourcemanager.am.max-attempts is an
>> upper
>> > bound for the application restarts. Therfore, the number of application
>> > attempts set within Flink cannot exceed the YARN cluster setting with
>> which
>> > YARN was started.
>> > """
>> >
>> > However, after some tests conducted by my colleagues and after looking
>> at
>> > the code (FlinkYarnClientBase:522-536) it seems to me that the
>> > flink-conf.yaml key, if set, overrides the yarn-site.xml, which in turn
>> > overrides the fallback value of 1. Is this right? Is the documentation
>> > wrong?
>> >
>> > --
>> > BR,
>> > Stefano Baghino
>> >
>> > Software Engineer @ Radicalbit
>>
>
>
>
> --
> BR,
> Stefano Baghino
>
> Software Engineer @ Radicalbit
>

Reply via email to