This is great, I will try option 3 and let you know.
Can I log some message so I know job is recovered from the latest savepoint?

On Sun, Nov 11, 2018 at 10:42 AM Ufuk Celebi <u...@apache.org> wrote:

> Hey Hao and Paul,
>
> 1) Fetch checkpoint info manually from ZK (problematic, not recommended)
> - As Paul pointed out, this is problematic as the node is a serialized
> pointer (StateHandle) to a CompletedCheckpoint in the HA storage
> directory and not a path [1].
> - I would not recommend this approach at the moment
>
> 2) Using the HTTP API to fetch the latest savepoint pointer (possible,
> but cumbersome)
> - As Paul proposed, you could use /jobs/:jobid/checkpoints to fetch
> checkpoint statistics about the latest savepoint
> - The latest savepoint is available as a separate entry under
> `latest.savepoint` (If I'm reading the docs [2] correctly)
> - You would need to manually do this before shutting down (requires
> custom tooling to automate)
>
> 3) Use cancelWithSavepoint
> - If you keep `high-availability.cluster-id` consistent between
> executions of your job cluster, using cancelWithSavepoint [3] should
> add the the savepoint to ZK before cancelling the job
> - On the next execution of your job cluster, Flink should
> automatically pick it up (no need to attach a savepoint restore path
> manually)
>
> I've not tried 3) myself yet, but believe it should work. If you have
> time to try it out, I'd be happy to hear whether it works as expected
> for you.
>
> – Ufuk
>
> [1] I believe this is overly complicated and should be simplified in the
> future.
> [2] Search /jobs/:jobid/checkpoints in
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html>
> [3]
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html#job-cancellation
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html#job-cancellation>
>
> On Fri, Nov 9, 2018 at 5:03 PM Hao Sun <ha...@zendesk.com> wrote:
> >
> > Can we add an option to allow job cluster mode to start from the latest
> save point? Otherwise I have to somehow get the info from ZK, before job
> cluster's container started by K8s.
> >
> > On Fri, Nov 9, 2018, 01:29 Paul Lam <paullin3...@gmail.com> wrote:
> >>
> >> Hi Hao,
> >>
> >> The savepoint path is stored in ZK, but it’s in binary format, so in
> order to retrieve the path you have to deserialize it back to some Flink
> internal object.
> >>
> >> A better approach would be using REST api to get the path. You could
> find it here[1].
> >>
> >> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html
> <https://ci.apache.org/projects/flink/flink-docs-release-1.6/monitoring/rest_api.html>
> >>
> >> Best,
> >> Paul Lam
> >>
> >>
> >> 在 2018年11月9日,13:55,Hao Sun <ha...@zendesk.com> 写道:
> >>
> >> Since this save point path is very useful to application updates, where
> is this information stored? Can we keep it in ZK or S3 for retrieval?
> >>
> >> <image.png>
> >>
> >>
>

Reply via email to