Hi, Externalized checkpoints [1] seems to be exactly what you are looking for.
Checkpoints are by default not persisted, unless configured otherwise to be externalized so that they are not automatically cleaned up when the job fails. They can be used to resume the job. On the other hand, it would be interesting to understand why your savepoint restore sometimes fail. If you suspect it could be an issue with Flink, could you provide any more details on the failure? Cheers, Gordon [1] https://ci.apache.org/projects/flink/flink-docs-release-1.4/ops/state/checkpoints.html#externalized-checkpoints -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/