Hi, We have identified the cause of the problem: ConfigMaps on the Kubernetes cluster were taking increasingly longer to update as the number of jobs grew (due to the expanding size of the ConfigMaps). The default value *high-availability.kubernetes.leader-election.retry-period=5s* was too frequent for our case, leading to an infinite increase in threads within *Executors.newCachedThreadPool(new ExecutorThreadFactory("config-map-watch-handler")),* which is initialized by *KubernetesLeaderElectionHaServices*. This caused uncontrolled thread growth and a rise in JM memory usage. Increasing the value of *high-availability.kubernetes.leader-election.retry-period* to 50s resolved the problem.
During the analysis of this case, several additional questions arose regarding task recovery: 1. Based on the JobManager logs, tasks on the cluster appear to recover sequentially (if our analysis is correct). Is there a way to parallelize the recovery process, or are there configuration options to improve recovery performance? 2. During JobManager restarts, we occasionally observe uncontrolled creation of TaskManagers exceeding the required number. These TaskManagers eventually become unnecessary (all slots are free) and are removed due to a timeout. Why does this happen? We tried the *resourcemanager.previous-worker.recovery.timeout *option, but it had no effect. вт, 18 мар. 2025 г. в 07:59, Vladislav Keda <vladislavk...@gmail.com>: > Hi, > > We use Native Kubernetes Session cluster with enabled Kubernetes HA for > running Flink Streaming Jobs. Over time, the number of running jobs > approached 100 and we noticed that the recovery time (when JM crashes) > grows exponentially with an increase in the number of jobs. Below are > graphs of the appearance in the JM logs of *"Job ... switched from state > CREATED to RUNNING"* when running 80 jobs in different namespaces: > [image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image: > telegram-cloud-photo-size-2-5388867639654347805-y.jpg] > > When manually submitting jobs (without HA recovery), we see a linear > growth on the 200 job horizon. That is, starting 50 jobs takes on average > 15 minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour. > > When jobs recover the resources (CPU and RAM) of JM and TM are occupied by > no more than half of its limits. The size of the last checkpoint of each > job (from which it is restored) takes no more than 20 MB. And it takes no > more than 5 seconds to download the checkpoint from s3. > > So why do we see an exponential increase of job recovery time? Do you need > any additional information? > > P.S. We use Flink 1.18.1 >