Hi Chen, Can you tell us a bit more about the job you are using? The intended behaviour you are seeking can only be achieved If the Kubernetes HA Services are enabled [1][2]. Otherwise the job cannot recall past checkpoints.
Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up> [2] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#high-availability-with-standalone-kubernetes > On 14. May 2021, at 10:21, ChangZhuo Chen (陳昌倬) <czc...@czchen.org> wrote: > > Hi, > > Recently, we changed our deployment to Kubernetes Standalone Application > Cluster for reactive mode. According to [0], we use Kubernetes Job with > --fromSavepoint to upgrade our application without losing state. The Job > config is identical to the one in document. > > However, we found that in this setup, if there is a failure in > jobmanager, Kubernetes will restart the jobmanager with original > savepoint specific in `--fromSavepoint`, instead of the latest > checkpoint. It causes problem when it is a long running job. > > Any idea for how to make Flink restoring from latest checkpoint when it > is jobmanager failure in Kubernetes Standalone Application Cluster. > > > [0] > https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deploy-application-cluster > > > -- > ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org > http://czchen.info/ > Key fingerprint = BA04 346D C2E1 FE63 C790 8793 CC65 B0CD EC27 5D5B