Hi Chen,

Can you tell us a bit more about the job you are using? 
The intended behaviour you are seeking can only be achieved 
If the Kubernetes HA Services are enabled [1][2].
Otherwise the job cannot recall past checkpoints.

Best,
Fabian

[1] 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up
 
<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up>
[2] 
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#high-availability-with-standalone-kubernetes
> On 14. May 2021, at 10:21, ChangZhuo Chen (陳昌倬) <czc...@czchen.org> wrote:
> 
> Hi,
> 
> Recently, we changed our deployment to Kubernetes Standalone Application
> Cluster for reactive mode. According to [0], we use Kubernetes Job with
> --fromSavepoint to upgrade our application without losing state. The Job
> config is identical to the one in document.
> 
> However, we found that in this setup, if there is a failure in
> jobmanager, Kubernetes will restart the jobmanager with original
> savepoint specific in `--fromSavepoint`, instead of the latest
> checkpoint. It causes problem when it is a long running job.
> 
> Any idea for how to make Flink restoring from latest checkpoint when it
> is jobmanager failure in Kubernetes Standalone Application Cluster.
> 
> 
> [0] 
> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deploy-application-cluster
> 
> 
> -- 
> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
> http://czchen.info/
> Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B

Reply via email to