Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

Ufuk Celebi Fri, 04 Nov 2016 07:34:54 -0700

If the configured ZooKeeper paths are still the same, the job should
be recovered automatically. On each submission a unique ZK namespace
is used based on the app ID.


So you have in ZK:
/flink/app_id/...

You would have to set that manually to resume an old application. You
can do this via -z flag
(https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cli.html).

Does this work?

On Fri, Nov 4, 2016 at 3:28 PM, Josh <jof...@gmail.com> wrote:
> Hi Ufuk,
>
> I see, but in my case the failure caused YARN application moved into a
> finished/failed state - so the application itself is no longer running. How
> can I restart the application (or start a new YARN application) and ensure
> that it uses the checkpoint pointer stored in Zookeeper?
>
> Thanks,
> Josh
>
> On Fri, Nov 4, 2016 at 1:52 PM, Ufuk Celebi <u...@apache.org> wrote:
>>
>> No you don't need to manually trigger a savepoint. With HA checkpoints
>> are persisted externally and store a pointer in ZooKeeper to recover
>> them after a JobManager failure.
>>
>> On Fri, Nov 4, 2016 at 2:27 PM, Josh <jof...@gmail.com> wrote:
>> > I have a follow up question to this - if I'm running a job in
>> > 'yarn-cluster'
>> > mode with HA and then at some point the YARN application fails due to
>> > some
>> > hardware failure (i.e. the YARN application moves to "FINISHED"/"FAILED"
>> > state), how can I restore the job from the most recent checkpoint?
>> >
>> > I can use `flink run -m yarn-cluster -s s3://my-savepoints/id .....` to
>> > restore from a savepoint, but what if I haven't manually taken a
>> > savepoint
>> > recently?
>> >
>> > Thanks,
>> > Josh
>> >
>> > On Fri, Nov 4, 2016 at 10:06 AM, Maximilian Michels <m...@apache.org>
>> > wrote:
>> >>
>> >> Hi Anchit,
>> >>
>> >> The documentation mentions that you need Zookeeper in addition to
>> >> setting the application attempts. Zookeeper is needed to retrieve the
>> >> current leader for the client and to filter out old leaders in case
>> >> multiple exist (old processes could even stay alive in Yarn). Moreover,
>> >> it
>> >> is needed to persist the state of the application.
>> >>
>> >>
>> >> -Max
>> >>
>> >>
>> >> On Thu, Nov 3, 2016 at 7:43 PM, Anchit Jatana
>> >> <development.anc...@gmail.com> wrote:
>> >> > Hi Maximilian,
>> >> >
>> >> > Thanks for you response. Since, I'm running the application on YARN
>> >> > cluster
>> >> > using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..'
>> >> > command.
>> >> > Is there anything more that I need to configure apart from setting up
>> >> > 'yarn.application-attempts: 10' property inside conf/flink-conf.yaml.
>> >> >
>> >> > Just wished to confirm if there is anything more that I need to
>> >> > configure to
>> >> > set up HA on 'yarn-cluster' mode.
>> >> >
>> >> > Thank you
>> >> >
>> >> > Regards,
>> >> > Anchit
>> >> >
>> >> >
>> >> >
>> >> > --
>> >> > View this message in context:
>> >> >
>> >> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Application-on-YARN-failed-on-losing-Job-Manager-No-recovery-Need-help-debug-the-cause-from-los-tp9839p9887.html
>> >> > Sent from the Apache Flink User Mailing List archive. mailing list
>> >> > archive at Nabble.com.
>> >
>> >
>
>

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

Reply via email to