Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

Josh Fri, 04 Nov 2016 07:29:31 -0700

Hi Ufuk,

I see, but in my case the failure caused YARN application moved into a
finished/failed state - so the application itself is no longer running. How
can I restart the application (or start a new YARN application) and ensure
that it uses the checkpoint pointer stored in Zookeeper?


Thanks,
Josh

On Fri, Nov 4, 2016 at 1:52 PM, Ufuk Celebi <u...@apache.org> wrote:

> No you don't need to manually trigger a savepoint. With HA checkpoints
> are persisted externally and store a pointer in ZooKeeper to recover
> them after a JobManager failure.
>
> On Fri, Nov 4, 2016 at 2:27 PM, Josh <jof...@gmail.com> wrote:
> > I have a follow up question to this - if I'm running a job in
> 'yarn-cluster'
> > mode with HA and then at some point the YARN application fails due to
> some
> > hardware failure (i.e. the YARN application moves to "FINISHED"/"FAILED"
> > state), how can I restore the job from the most recent checkpoint?
> >
> > I can use `flink run -m yarn-cluster -s s3://my-savepoints/id .....` to
> > restore from a savepoint, but what if I haven't manually taken a
> savepoint
> > recently?
> >
> > Thanks,
> > Josh
> >
> > On Fri, Nov 4, 2016 at 10:06 AM, Maximilian Michels <m...@apache.org>
> wrote:
> >>
> >> Hi Anchit,
> >>
> >> The documentation mentions that you need Zookeeper in addition to
> >> setting the application attempts. Zookeeper is needed to retrieve the
> >> current leader for the client and to filter out old leaders in case
> >> multiple exist (old processes could even stay alive in Yarn). Moreover,
> it
> >> is needed to persist the state of the application.
> >>
> >>
> >> -Max
> >>
> >>
> >> On Thu, Nov 3, 2016 at 7:43 PM, Anchit Jatana
> >> <development.anc...@gmail.com> wrote:
> >> > Hi Maximilian,
> >> >
> >> > Thanks for you response. Since, I'm running the application on YARN
> >> > cluster
> >> > using 'yarn-cluster' mode i.e. using 'flink run -m yarn-cluster ..'
> >> > command.
> >> > Is there anything more that I need to configure apart from setting up
> >> > 'yarn.application-attempts: 10' property inside conf/flink-conf.yaml.
> >> >
> >> > Just wished to confirm if there is anything more that I need to
> >> > configure to
> >> > set up HA on 'yarn-cluster' mode.
> >> >
> >> > Thank you
> >> >
> >> > Regards,
> >> > Anchit
> >> >
> >> >
> >> >
> >> > --
> >> > View this message in context:
> >> > http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/Flink-Application-on-YARN-failed-on-losing-Job-Manager-No-
> recovery-Need-help-debug-the-cause-from-los-tp9839p9887.html
> >> > Sent from the Apache Flink User Mailing List archive. mailing list
> >> > archive at Nabble.com.
> >
> >
>

Re: Flink Application on YARN failed on losing Job Manager | No recovery | Need help debug the cause from logs

Reply via email to