Hi, Can you help please with previous questions?
Thanks in advance! ср, 26 мар. 2025 г., 12:06 Vladislav Keda <vladislavk...@gmail.com>: > Hi, > > We have identified the cause of the problem: ConfigMaps on the Kubernetes > cluster were taking increasingly longer to update as the number of jobs > grew (due to the expanding size of the ConfigMaps). The default value > *high-availability.kubernetes.leader-election.retry-period=5s* was too > frequent for our case, leading to an infinite increase in threads within > *Executors.newCachedThreadPool(new > ExecutorThreadFactory("config-map-watch-handler")),* which is initialized > by *KubernetesLeaderElectionHaServices*. This caused uncontrolled thread > growth and a rise in JM memory usage. Increasing the value of > *high-availability.kubernetes.leader-election.retry-period* to 50s > resolved the problem. > > During the analysis of this case, several additional questions arose > regarding task recovery: > 1. Based on the JobManager logs, tasks on the cluster appear to recover > sequentially (if our analysis is correct). Is there a way to parallelize > the recovery process, or are there configuration options to improve > recovery performance? > 2. During JobManager restarts, we occasionally observe uncontrolled > creation of TaskManagers exceeding the required number. These TaskManagers > eventually become unnecessary (all slots are free) and are removed due to a > timeout. Why does this happen? We tried the > *resourcemanager.previous-worker.recovery.timeout > *option, but it had no effect. > > вт, 18 мар. 2025 г. в 07:59, Vladislav Keda <vladislavk...@gmail.com>: > >> Hi, >> >> We use Native Kubernetes Session cluster with enabled Kubernetes HA for >> running Flink Streaming Jobs. Over time, the number of running jobs >> approached 100 and we noticed that the recovery time (when JM crashes) >> grows exponentially with an increase in the number of jobs. Below are >> graphs of the appearance in the JM logs of *"Job ... switched from state >> CREATED to RUNNING"* when running 80 jobs in different namespaces: >> [image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image: >> telegram-cloud-photo-size-2-5388867639654347805-y.jpg] >> >> When manually submitting jobs (without HA recovery), we see a linear >> growth on the 200 job horizon. That is, starting 50 jobs takes on average >> 15 minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour. >> >> When jobs recover the resources (CPU and RAM) of JM and TM are occupied >> by no more than half of its limits. The size of the last checkpoint of each >> job (from which it is restored) takes no more than 20 MB. And it takes no >> more than 5 seconds to download the checkpoint from s3. >> >> So why do we see an exponential increase of job recovery time? Do you >> need any additional information? >> >> P.S. We use Flink 1.18.1 >> >