Hi, Recently, we changed our deployment to Kubernetes Standalone Application Cluster for reactive mode. According to [0], we use Kubernetes Job with --fromSavepoint to upgrade our application without losing state. The Job config is identical to the one in document.
However, we found that in this setup, if there is a failure in jobmanager, Kubernetes will restart the jobmanager with original savepoint specific in `--fromSavepoint`, instead of the latest checkpoint. It causes problem when it is a long running job. Any idea for how to make Flink restoring from latest checkpoint when it is jobmanager failure in Kubernetes Standalone Application Cluster. [0] https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deploy-application-cluster -- ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org http://czchen.info/ Key fingerprint = BA04 346D C2E1 FE63 C790 8793 CC65 B0CD EC27 5D5B
signature.asc
Description: PGP signature