Hi, up questions
пн, 31 мар. 2025 г., 12:34 Vladislav Keda <vladislavk...@gmail.com>: > Hi, > > Can you help please with previous questions? > > Thanks in advance! > > ср, 26 мар. 2025 г., 12:06 Vladislav Keda <vladislavk...@gmail.com>: > >> Hi, >> >> We have identified the cause of the problem: ConfigMaps on the Kubernetes >> cluster were taking increasingly longer to update as the number of jobs >> grew (due to the expanding size of the ConfigMaps). The default value >> *high-availability.kubernetes.leader-election.retry-period=5s* was too >> frequent for our case, leading to an infinite increase in threads within >> *Executors.newCachedThreadPool(new >> ExecutorThreadFactory("config-map-watch-handler")),* which is >> initialized by *KubernetesLeaderElectionHaServices*. This caused >> uncontrolled thread growth and a rise in JM memory usage. Increasing the >> value of *high-availability.kubernetes.leader-election.retry-period* to >> 50s resolved the problem. >> >> During the analysis of this case, several additional questions arose >> regarding task recovery: >> 1. Based on the JobManager logs, tasks on the cluster appear to recover >> sequentially (if our analysis is correct). Is there a way to parallelize >> the recovery process, or are there configuration options to improve >> recovery performance? >> 2. During JobManager restarts, we occasionally observe uncontrolled >> creation of TaskManagers exceeding the required number. These TaskManagers >> eventually become unnecessary (all slots are free) and are removed due to a >> timeout. Why does this happen? We tried the >> *resourcemanager.previous-worker.recovery.timeout >> *option, but it had no effect. >> >> вт, 18 мар. 2025 г. в 07:59, Vladislav Keda <vladislavk...@gmail.com>: >> >>> Hi, >>> >>> We use Native Kubernetes Session cluster with enabled Kubernetes HA for >>> running Flink Streaming Jobs. Over time, the number of running jobs >>> approached 100 and we noticed that the recovery time (when JM crashes) >>> grows exponentially with an increase in the number of jobs. Below are >>> graphs of the appearance in the JM logs of *"Job ... switched from >>> state CREATED to RUNNING"* when running 80 jobs in different namespaces: >>> [image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image: >>> telegram-cloud-photo-size-2-5388867639654347805-y.jpg] >>> >>> When manually submitting jobs (without HA recovery), we see a linear >>> growth on the 200 job horizon. That is, starting 50 jobs takes on average >>> 15 minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour. >>> >>> When jobs recover the resources (CPU and RAM) of JM and TM are occupied >>> by no more than half of its limits. The size of the last checkpoint of each >>> job (from which it is restored) takes no more than 20 MB. And it takes no >>> more than 5 seconds to download the checkpoint from s3. >>> >>> So why do we see an exponential increase of job recovery time? Do you >>> need any additional information? >>> >>> P.S. We use Flink 1.18.1 >>> >>