I have a follow up question to this - if I'm running a job in
'yarn-cluster' mode with HA and then at some point the YARN application
fails due to some hardware failure (i.e. the YARN application moves to
"FINISHED"/"FAILED" state), how can I restore the job from the most recent
checkpoint?

I can use `flink run -m yarn-cluster -s s3://my-savepoints/id .....` to
restore from a savepoint, but what if I haven't manually taken a savepoint
recently?

Thanks,
Josh

On Fri, Nov 4, 2016 at 10:06 AM, Maximilian Michels <m...@apache.org> wrote:

> Hi Anchit,
>
> The documentation mentions that you need Zookeeper in addition to
> setting the application attempts. Zookeeper is needed to retrieve the
> current leader for the client and to filter out old leaders in case
> multiple exist (old processes could even stay alive in Yarn). Moreover, it
> is needed to persist the state of the application.
>
>
> -Max
>
>
> On Thu, Nov 3, 2016 at 7:43 PM, Anchit Jatana
> <development.anc...@gmail.com> wrote:
> > Hi Maximilian,
> >
> > Thanks for you response. Since, I'm running the application on YARN
> cluster
> > using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..'
> command.
> > Is there anything more that I need to configure apart from setting up
> > 'yarn.application-attempts: 10' property inside conf/flink-conf.yaml.
> >
> > Just wished to confirm if there is anything more that I need to
> configure to
> > set up HA on 'yarn-cluster' mode.
> >
> > Thank you
> >
> > Regards,
> > Anchit
> >
> >
> >
> > --
> > View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/Flink-
> Application-on-YARN-failed-on-losing-Job-Manager-No-
> recovery-Need-help-debug-the-cause-from-los-tp9839p9887.html
> > Sent from the Apache Flink User Mailing List archive. mailing list
> archive at Nabble.com.
>

Reply via email to