How do you deploy Flink on Kubernetes? Do you use the standalone <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/> or native <https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/> mode?

Is it really just task managers going down? It seems unlikely that the loss of a TM could have such an effect.

Can you provide us with the JobManager logs at the time the TM crash occurred? They should contain some hints as to how Flink handled the TM failure.


On 19/08/2021 16:06, Kevin Lam wrote:
Hi all,

I've noticed that sometimes when task managers go down--it looks like the job is not restored from checkpoint, but instead restarted from a fresh state (when I go to the job's checkpoint tab in the UI, I don't see the restore, and the number in the job overview all get reset). Under what circumstances does this happen?

I've been trying to debug and we really want the job to restore from the checkpoint at all times for our use case.

We're running Apache Flink 1.13 on Kubernetes in a high availability set-up.

Thanks in advance!


Reply via email to