Re: Job manager sometimes doesn't restore job from checkpoint post TaskManager failure

Chesnay Schepler Thu, 19 Aug 2021 11:04:48 -0700

How do you deploy Flink on Kubernetes? Do you use the standalone<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/>or native<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/>mode?

Is it really just task managers going down? It seems unlikely that theloss of a TM could have such an effect.

Can you provide us with the JobManager logs at the time the TM crashoccurred? They should contain some hints as to how Flink handled the TMfailure.



On 19/08/2021 16:06, Kevin Lam wrote:

Hi all,
I've noticed that sometimes when task managers go down--it looks likethe job is not restored from checkpoint, but instead restarted from afresh state (when I go to the job's checkpoint tab in the UI, I don'tsee the restore, and the number in the job overview all get reset).Under what circumstances does this happen?
I've been trying to debug and we really want the job to restore fromthe checkpoint at all times for our use case.
We're running Apache Flink 1.13 on Kubernetes in a high availabilityset-up.
Thanks in advance!

Re: Job manager sometimes doesn't restore job from checkpoint post TaskManager failure

Reply via email to