Hii All, We run a Flink operator on GKE, deploying one Flink job per job manager. We utilize org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory for high availability. The JobManager employs config maps for checkpointing and leader election. If, at any point, the Kube API server returns an error (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic, happening every 1-2 days for some jobs among the 400 running in the same cluster, each with its JobManager pod.
What might be causing these errors from the Kube? One possibility is that when the JM writes the config map and attempts to retrieve it immediately after, it could result in a 404 error. Are there any configurations to increase heartbeat or timeouts that might be causing temporary disconnections from the Kube API server? Thank you!