If the configured ZooKeeper paths are still the same, the job should be recovered automatically. On each submission a unique ZK namespace is used based on the app ID.
So you have in ZK: /flink/app_id/... You would have to set that manually to resume an old application. You can do this via -z flag (https://ci.apache.org/projects/flink/flink-docs-release-1.2/setup/cli.html). Does this work? On Fri, Nov 4, 2016 at 3:28 PM, Josh <jof...@gmail.com> wrote: > Hi Ufuk, > > I see, but in my case the failure caused YARN application moved into a > finished/failed state - so the application itself is no longer running. How > can I restart the application (or start a new YARN application) and ensure > that it uses the checkpoint pointer stored in Zookeeper? > > Thanks, > Josh > > On Fri, Nov 4, 2016 at 1:52 PM, Ufuk Celebi <u...@apache.org> wrote: >> >> No you don't need to manually trigger a savepoint. With HA checkpoints >> are persisted externally and store a pointer in ZooKeeper to recover >> them after a JobManager failure. >> >> On Fri, Nov 4, 2016 at 2:27 PM, Josh <jof...@gmail.com> wrote: >> > I have a follow up question to this - if I'm running a job in >> > 'yarn-cluster' >> > mode with HA and then at some point the YARN application fails due to >> > some >> > hardware failure (i.e. the YARN application moves to "FINISHED"/"FAILED" >> > state), how can I restore the job from the most recent checkpoint? >> > >> > I can use `flink run -m yarn-cluster -s s3://my-savepoints/id .....` to >> > restore from a savepoint, but what if I haven't manually taken a >> > savepoint >> > recently? >> > >> > Thanks, >> > Josh >> > >> > On Fri, Nov 4, 2016 at 10:06 AM, Maximilian Michels <m...@apache.org> >> > wrote: >> >> >> >> Hi Anchit, >> >> >> >> The documentation mentions that you need Zookeeper in addition to >> >> setting the application attempts. Zookeeper is needed to retrieve the >> >> current leader for the client and to filter out old leaders in case >> >> multiple exist (old processes could even stay alive in Yarn). Moreover, >> >> it >> >> is needed to persist the state of the application. >> >> >> >> >> >> -Max >> >> >> >> >> >> On Thu, Nov 3, 2016 at 7:43 PM, Anchit Jatana >> >> <development.anc...@gmail.com> wrote: >> >> > Hi Maximilian, >> >> > >> >> > Thanks for you response. Since, I'm running the application on YARN >> >> > cluster >> >> > using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..' >> >> > command. >> >> > Is there anything more that I need to configure apart from setting up >> >> > 'yarn.application-attempts: 10' property inside conf/flink-conf.yaml. >> >> > >> >> > Just wished to confirm if there is anything more that I need to >> >> > configure to >> >> > set up HA on 'yarn-cluster' mode. >> >> > >> >> > Thank you >> >> > >> >> > Regards, >> >> > Anchit >> >> > >> >> > >> >> > >> >> > -- >> >> > View this message in context: >> >> > >> >> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Flink-Application-on-YARN-failed-on-losing-Job-Manager-No-recovery-Need-help-debug-the-cause-from-los-tp9839p9887.html >> >> > Sent from the Apache Flink User Mailing List archive. mailing list >> >> > archive at Nabble.com. >> > >> > > >