Re: Too long jobs recover after JobManager fail with Kubernetes HA

Vladislav Keda Mon, 31 Mar 2025 03:34:23 -0700

Hi,

Can you help please with previous questions?


Thanks in advance!

ср, 26 мар. 2025 г., 12:06 Vladislav Keda <vladislavk...@gmail.com>:

> Hi,
>
> We have identified the cause of the problem: ConfigMaps on the Kubernetes
> cluster were taking increasingly longer to update as the number of jobs
> grew (due to the expanding size of the ConfigMaps). The default value
> *high-availability.kubernetes.leader-election.retry-period=5s* was too
> frequent for our case, leading to an infinite increase in threads within 
> *Executors.newCachedThreadPool(new
> ExecutorThreadFactory("config-map-watch-handler")),* which is initialized
> by *KubernetesLeaderElectionHaServices*. This caused uncontrolled thread
> growth and a rise in JM memory usage. Increasing the value of
> *high-availability.kubernetes.leader-election.retry-period* to 50s
> resolved the problem.
>
> During the analysis of this case, several additional questions arose
> regarding task recovery:
> 1. Based on the JobManager logs, tasks on the cluster appear to recover
> sequentially (if our analysis is correct). Is there a way to parallelize
> the recovery process, or are there configuration options to improve
> recovery performance?
> 2. During JobManager restarts, we occasionally observe uncontrolled
> creation of TaskManagers exceeding the required number. These TaskManagers
> eventually become unnecessary (all slots are free) and are removed due to a
> timeout. Why does this happen? We tried the 
> *resourcemanager.previous-worker.recovery.timeout
> *option, but it had no effect.
>
> вт, 18 мар. 2025 г. в 07:59, Vladislav Keda <vladislavk...@gmail.com>:
>
>> Hi,
>>
>> We use Native Kubernetes Session cluster with enabled Kubernetes HA for
>> running Flink Streaming Jobs. Over time, the number of running jobs
>> approached 100 and we noticed that the recovery time (when JM crashes)
>> grows exponentially with an increase in the number of jobs. Below are
>> graphs of the appearance in the JM logs of *"Job ... switched from state
>> CREATED to RUNNING"* when running 80 jobs in different namespaces:
>> [image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image:
>> telegram-cloud-photo-size-2-5388867639654347805-y.jpg]
>>
>> When manually submitting jobs (without HA recovery), we see a linear
>> growth on the 200 job horizon. That is, starting 50 jobs takes on average
>> 15 minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour.
>>
>> When jobs recover the resources (CPU and RAM) of JM and TM are occupied
>> by no more than half of its limits. The size of the last checkpoint of each
>> job (from which it is restored) takes no more than 20 MB. And it takes no
>> more than 5 seconds to download the checkpoint from s3.
>>
>> So why do we see an exponential increase of job recovery time? Do you
>> need any additional information?
>>
>> P.S. We use Flink 1.18.1
>>
>

Re: Too long jobs recover after JobManager fail with Kubernetes HA

Reply via email to