Thanks, I didn't know about the -z flag! I haven't been able to get it to work though (using yarn-cluster, with a zookeeper root configured to /flink in my flink-conf.yaml)
I can see my job directory in ZK under /flink/application_1477475694024_0015 and I've tried a few ways to restore the job: ./bin/flink run -m yarn-cluster -yz /application_1477475694024_0015 .... ./bin/flink run -m yarn-cluster -yz application_1477475694024_0015 .... ./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015/ .... ./bin/flink run -m yarn-cluster -yz /flink/application_1477475694024_0015 .... The job starts from scratch each time, without restored state. Am I doing something wrong? I've also tried with -z instead of -yz but I'm using yarn-cluster to run a single job, so I think it should be -yz. On Fri, Nov 4, 2016 at 2:33 PM, Ufuk Celebi <u...@apache.org> wrote: > If the configured ZooKeeper paths are still the same, the job should > be recovered automatically. On each submission a unique ZK namespace > is used based on the app ID. > > So you have in ZK: > /flink/app_id/... > > You would have to set that manually to resume an old application. You > can do this via -z flag > (https://ci.apache.org/projects/flink/flink-docs- > release-1.2/setup/cli.html). > > Does this work? > > On Fri, Nov 4, 2016 at 3:28 PM, Josh <jof...@gmail.com> wrote: > > Hi Ufuk, > > > > I see, but in my case the failure caused YARN application moved into a > > finished/failed state - so the application itself is no longer running. > How > > can I restart the application (or start a new YARN application) and > ensure > > that it uses the checkpoint pointer stored in Zookeeper? > > > > Thanks, > > Josh > > > > On Fri, Nov 4, 2016 at 1:52 PM, Ufuk Celebi <u...@apache.org> wrote: > >> > >> No you don't need to manually trigger a savepoint. With HA checkpoints > >> are persisted externally and store a pointer in ZooKeeper to recover > >> them after a JobManager failure. > >> > >> On Fri, Nov 4, 2016 at 2:27 PM, Josh <jof...@gmail.com> wrote: > >> > I have a follow up question to this - if I'm running a job in > >> > 'yarn-cluster' > >> > mode with HA and then at some point the YARN application fails due to > >> > some > >> > hardware failure (i.e. the YARN application moves to > "FINISHED"/"FAILED" > >> > state), how can I restore the job from the most recent checkpoint? > >> > > >> > I can use `flink run -m yarn-cluster -s s3://my-savepoints/id .....` > to > >> > restore from a savepoint, but what if I haven't manually taken a > >> > savepoint > >> > recently? > >> > > >> > Thanks, > >> > Josh > >> > > >> > On Fri, Nov 4, 2016 at 10:06 AM, Maximilian Michels <m...@apache.org> > >> > wrote: > >> >> > >> >> Hi Anchit, > >> >> > >> >> The documentation mentions that you need Zookeeper in addition to > >> >> setting the application attempts. Zookeeper is needed to retrieve the > >> >> current leader for the client and to filter out old leaders in case > >> >> multiple exist (old processes could even stay alive in Yarn). > Moreover, > >> >> it > >> >> is needed to persist the state of the application. > >> >> > >> >> > >> >> -Max > >> >> > >> >> > >> >> On Thu, Nov 3, 2016 at 7:43 PM, Anchit Jatana > >> >> <development.anc...@gmail.com> wrote: > >> >> > Hi Maximilian, > >> >> > > >> >> > Thanks for you response. Since, I'm running the application on YARN > >> >> > cluster > >> >> > using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..' > >> >> > command. > >> >> > Is there anything more that I need to configure apart from setting > up > >> >> > 'yarn.application-attempts: 10' property inside > conf/flink-conf.yaml. > >> >> > > >> >> > Just wished to confirm if there is anything more that I need to > >> >> > configure to > >> >> > set up HA on 'yarn-cluster' mode. > >> >> > > >> >> > Thank you > >> >> > > >> >> > Regards, > >> >> > Anchit > >> >> > > >> >> > > >> >> > > >> >> > -- > >> >> > View this message in context: > >> >> > > >> >> > http://apache-flink-user-mailing-list-archive.2336050. > n4.nabble.com/Flink-Application-on-YARN-failed-on-losing-Job-Manager-No- > recovery-Need-help-debug-the-cause-from-los-tp9839p9887.html > >> >> > Sent from the Apache Flink User Mailing List archive. mailing list > >> >> > archive at Nabble.com. > >> > > >> > > > > > >