Hi Lavkesh,
FLINK-33998 [1] sounds quite similar to what you describe.

The solution was to upgrade to Flink version 1.14.6. I didn't have the
capacity to look into the details considering that the mentioned Flink
version 1.14 is not officially supported by the community anymore and a fix
seems to have been provided with a newer version.

Matthias

[1] https://issues.apache.org/jira/browse/FLINK-33998

On Mon, Feb 5, 2024 at 6:18 AM Lavkesh Lahngir <lavk...@linux.com> wrote:

> Hii, Few more details:
> We are running GKE version 1.27.7-gke.1121002.
> and using flink version 1.14.3.
>
> Thanks!
>
> On Mon, 5 Feb 2024 at 12:05, Lavkesh Lahngir <lavk...@linux.com> wrote:
>
> > Hii All,
> >
> > We run a Flink operator on GKE, deploying one Flink job per job manager.
> > We utilize
> > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> > for high availability. The JobManager employs config maps for
> checkpointing
> > and leader election. If, at any point, the Kube API server returns an
> error
> > (5xx or 4xx), the JM pod is restarted. This occurrence is sporadic,
> > happening every 1-2 days for some jobs among the 400 running in the same
> > cluster, each with its JobManager pod.
> >
> > What might be causing these errors from the Kube? One possibility is that
> > when the JM writes the config map and attempts to retrieve it immediately
> > after, it could result in a 404 error.
> > Are there any configurations to increase heartbeat or timeouts that might
> > be causing temporary disconnections from the Kube API server?
> >
> > Thank you!
> >
>

Reply via email to