How do you deploy Flink on Kubernetes? Do you use the standalone
<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/>
or native
<https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/>
mode?
Is it really just task managers going down? It seems unlikely that the
loss of a TM could have such an effect.
Can you provide us with the JobManager logs at the time the TM crash
occurred? They should contain some hints as to how Flink handled the TM
failure.
On 19/08/2021 16:06, Kevin Lam wrote:
Hi all,
I've noticed that sometimes when task managers go down--it looks like
the job is not restored from checkpoint, but instead restarted from a
fresh state (when I go to the job's checkpoint tab in the UI, I don't
see the restore, and the number in the job overview all get reset).
Under what circumstances does this happen?
I've been trying to debug and we really want the job to restore from
the checkpoint at all times for our use case.
We're running Apache Flink 1.13 on Kubernetes in a high availability
set-up.
Thanks in advance!