Tooling for resuming from checkpoints

dominik Wed, 22 Nov 2017 01:42:13 -0800

Hey,

we are running Flink 1.3.2 with streaming jobs and we are running intoissues when we are restarting a complete job (which can happen due tovarious reasons: upgrading of the job, restarting of the cluster,failures). The problem is that there is no automated way to find outfrom which checkpoint-metadata (so externalized checkpoint) we shouldresume. There can always be the situation that we are left with multipleof those files: Now you want to use the most recent one which issuccessfully written.

Is there any tooling available already which picks the latest goodcheckpoint? Or at least a tool/commandline which we can use to validatethat a checkpoint is valid so we can pick the latest one?


How are others handling this? Manually?

Would be happy to get some input there,
Dominik

Tooling for resuming from checkpoints

Reply via email to