Re: Tooling for resuming from checkpoints

Timo Walther Wed, 22 Nov 2017 03:37:16 -0800

Hi Dominik,

the Web UI shows you the status of a checkpoint [0], so it might bepossible to retrieve the information via REST calls. Usually, you shouldperform a savepoint for planned restarts. If a savepoint is successfulyou can be sure to restart from it.

Otherwise the platform from data Artisans might be interesting for you[1], it aims to improve the deployment for streaming applicationlifecycles (disclaimer: I work for them).


Regards,
Timo

[0]https://ci.apache.org/projects/flink/flink-docs-release-1.4/monitoring/checkpoint_monitoring.html

[1] https://data-artisans.com/da-platform-2


Am 11/22/17 um 10:41 AM schrieb domi...@dbruhn.de:

Hey,
we are running Flink 1.3.2 with streaming jobs and we are running intoissues when we are restarting a complete job (which can happen due tovarious reasons: upgrading of the job, restarting of the cluster,failures). The problem is that there is no automated way to find outfrom which checkpoint-metadata (so externalized checkpoint) we shouldresume. There can always be the situation that we are left withmultiple of those files: Now you want to use the most recent one whichis successfully written.
Is there any tooling available already which picks the latest goodcheckpoint? Or at least a tool/commandline which we can use tovalidate that a checkpoint is valid so we can pick the latest one?
How are others handling this? Manually?

Would be happy to get some input there,
Dominik

Re: Tooling for resuming from checkpoints

Reply via email to