Frequent Flink JM restarts due to Kube API server errors.

Lavkesh Lahngir Sun, 04 Feb 2024 21:06:15 -0800

Hii All,

We run a Flink operator on GKE, deploying one Flink job per job manager. We
utilize
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
for high availability. The JobManager employs config maps for checkpointing
and leader election. If, at any point, the Kube API server returns an error
(5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
happening every 1-2 days for some jobs among the 400 running in the same
cluster, each with its JobManager pod.


What might be causing these errors from the Kube? One possibility is that
when the JM writes the config map and attempts to retrieve it immediately
after, it could result in a 404 error.
Are there any configurations to increase heartbeat or timeouts that might
be causing temporary disconnections from the Kube API server?

Thank you!

Frequent Flink JM restarts due to Kube API server errors.

Reply via email to