Re: Too long jobs recover after JobManager fail with Kubernetes HA

Vladislav Keda Fri, 11 Apr 2025 10:17:27 -0700

Hi,

up questions


пн, 31 мар. 2025 г., 12:34 Vladislav Keda <vladislavk...@gmail.com>:

> Hi,
>
> Can you help please with previous questions?
>
> Thanks in advance!
>
> ср, 26 мар. 2025 г., 12:06 Vladislav Keda <vladislavk...@gmail.com>:
>
>> Hi,
>>
>> We have identified the cause of the problem: ConfigMaps on the Kubernetes
>> cluster were taking increasingly longer to update as the number of jobs
>> grew (due to the expanding size of the ConfigMaps). The default value
>> *high-availability.kubernetes.leader-election.retry-period=5s* was too
>> frequent for our case, leading to an infinite increase in threads within 
>> *Executors.newCachedThreadPool(new
>> ExecutorThreadFactory("config-map-watch-handler")),* which is
>> initialized by *KubernetesLeaderElectionHaServices*. This caused
>> uncontrolled thread growth and a rise in JM memory usage. Increasing the
>> value of *high-availability.kubernetes.leader-election.retry-period* to
>> 50s resolved the problem.
>>
>> During the analysis of this case, several additional questions arose
>> regarding task recovery:
>> 1. Based on the JobManager logs, tasks on the cluster appear to recover
>> sequentially (if our analysis is correct). Is there a way to parallelize
>> the recovery process, or are there configuration options to improve
>> recovery performance?
>> 2. During JobManager restarts, we occasionally observe uncontrolled
>> creation of TaskManagers exceeding the required number. These TaskManagers
>> eventually become unnecessary (all slots are free) and are removed due to a
>> timeout. Why does this happen? We tried the 
>> *resourcemanager.previous-worker.recovery.timeout
>> *option, but it had no effect.
>>
>> вт, 18 мар. 2025 г. в 07:59, Vladislav Keda <vladislavk...@gmail.com>:
>>
>>> Hi,
>>>
>>> We use Native Kubernetes Session cluster with enabled Kubernetes HA for
>>> running Flink Streaming Jobs. Over time, the number of running jobs
>>> approached 100 and we noticed that the recovery time (when JM crashes)
>>> grows exponentially with an increase in the number of jobs. Below are
>>> graphs of the appearance in the JM logs of *"Job ... switched from
>>> state CREATED to RUNNING"* when running 80 jobs in different namespaces:
>>> [image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image:
>>> telegram-cloud-photo-size-2-5388867639654347805-y.jpg]
>>>
>>> When manually submitting jobs (without HA recovery), we see a linear
>>> growth on the 200 job horizon. That is, starting 50 jobs takes on average
>>> 15 minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour.
>>>
>>> When jobs recover the resources (CPU and RAM) of JM and TM are occupied
>>> by no more than half of its limits. The size of the last checkpoint of each
>>> job (from which it is restored) takes no more than 20 MB. And it takes no
>>> more than 5 seconds to download the checkpoint from s3.
>>>
>>> So why do we see an exponential increase of job recovery time? Do you
>>> need any additional information?
>>>
>>> P.S. We use Flink 1.18.1
>>>
>>

Re: Too long jobs recover after JobManager fail with Kubernetes HA

Reply via email to