I have a follow up question to this - if I'm running a job in 'yarn-cluster' mode with HA and then at some point the YARN application fails due to some hardware failure (i.e. the YARN application moves to "FINISHED"/"FAILED" state), how can I restore the job from the most recent checkpoint?
I can use `flink run -m yarn-cluster -s s3://my-savepoints/id .....` to restore from a savepoint, but what if I haven't manually taken a savepoint recently? Thanks, Josh On Fri, Nov 4, 2016 at 10:06 AM, Maximilian Michels <m...@apache.org> wrote: > Hi Anchit, > > The documentation mentions that you need Zookeeper in addition to > setting the application attempts. Zookeeper is needed to retrieve the > current leader for the client and to filter out old leaders in case > multiple exist (old processes could even stay alive in Yarn). Moreover, it > is needed to persist the state of the application. > > > -Max > > > On Thu, Nov 3, 2016 at 7:43 PM, Anchit Jatana > <development.anc...@gmail.com> wrote: > > Hi Maximilian, > > > > Thanks for you response. Since, I'm running the application on YARN > cluster > > using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..' > command. > > Is there anything more that I need to configure apart from setting up > > 'yarn.application-attempts: 10' property inside conf/flink-conf.yaml. > > > > Just wished to confirm if there is anything more that I need to > configure to > > set up HA on 'yarn-cluster' mode. > > > > Thank you > > > > Regards, > > Anchit > > > > > > > > -- > > View this message in context: http://apache-flink-user- > mailing-list-archive.2336050.n4.nabble.com/Flink- > Application-on-YARN-failed-on-losing-Job-Manager-No- > recovery-Need-help-debug-the-cause-from-los-tp9839p9887.html > > Sent from the Apache Flink User Mailing List archive. mailing list > archive at Nabble.com. >