Hello, We have deployed multiple Flink clusters on Kubernetess with 1 replica of Jobmanager and multiple of Taskmanager as per the requirement. Recently we are observing that on increasing the number of Taskmanagers for a cluster, the Jobmanager becomes irresponsive. It stops sending statsd metric for some irregular interval. Even the Jobmanager pod keeps restarting because it stops responding to the liveliness probe which results in Kubernetes killing the pod. We tried increasing the resources given(CPU, RAM) but it didn't help.
Regards Prakhar Mathur Product Engineer GO-JEK