Hi, We use Native Kubernetes Session cluster with enabled Kubernetes HA for running Flink Streaming Jobs. Over time, the number of running jobs approached 100 and we noticed that the recovery time (when JM crashes) grows exponentially with an increase in the number of jobs. Below are graphs of the appearance in the JM logs of *"Job ... switched from state CREATED to RUNNING"* when running 80 jobs in different namespaces: [image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image: telegram-cloud-photo-size-2-5388867639654347805-y.jpg]
When manually submitting jobs (without HA recovery), we see a linear growth on the 200 job horizon. That is, starting 50 jobs takes on average 15 minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour. When jobs recover the resources (CPU and RAM) of JM and TM are occupied by no more than half of its limits. The size of the last checkpoint of each job (from which it is restored) takes no more than 20 MB. And it takes no more than 5 seconds to download the checkpoint from s3. So why do we see an exponential increase of job recovery time? Do you need any additional information? P.S. We use Flink 1.18.1