Too long jobs recover after JobManager fail with Kubernetes HA

Vladislav Keda Mon, 17 Mar 2025 22:00:08 -0700

Hi,

We use Native Kubernetes Session cluster with enabled Kubernetes HA for
running Flink Streaming Jobs. Over time, the number of running jobs
approached 100 and we noticed that the recovery time (when JM crashes)
grows exponentially with an increase in the number of jobs. Below are
graphs of the appearance in the JM logs of *"Job ... switched from state
CREATED to RUNNING"* when running 80 jobs in different namespaces:
[image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image:
telegram-cloud-photo-size-2-5388867639654347805-y.jpg]


When manually submitting jobs (without HA recovery), we see a linear growth
on the 200 job horizon. That is, starting 50 jobs takes on average 15
minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour.

When jobs recover the resources (CPU and RAM) of JM and TM are occupied by
no more than half of its limits. The size of the last checkpoint of each
job (from which it is restored) takes no more than 20 MB. And it takes no
more than 5 seconds to download the checkpoint from s3.

So why do we see an exponential increase of job recovery time? Do you need
any additional information?

P.S. We use Flink 1.18.1

Too long jobs recover after JobManager fail with Kubernetes HA

Reply via email to