Re: Too long jobs recover after JobManager fail with Kubernetes HA

Vladislav Keda Wed, 26 Mar 2025 02:46:32 -0700

Hi,

We have identified the cause of the problem: ConfigMaps on the Kubernetes
cluster were taking increasingly longer to update as the number of jobs
grew (due to the expanding size of the ConfigMaps). The default value
*high-availability.kubernetes.leader-election.retry-period=5s* was too
frequent for our case, leading to an infinite increase in threads
within *Executors.newCachedThreadPool(new
ExecutorThreadFactory("config-map-watch-handler")),* which is initialized
by *KubernetesLeaderElectionHaServices*. This caused uncontrolled thread
growth and a rise in JM memory usage. Increasing the value of
*high-availability.kubernetes.leader-election.retry-period* to 50s resolved
the problem.


During the analysis of this case, several additional questions arose
regarding task recovery:
1. Based on the JobManager logs, tasks on the cluster appear to recover
sequentially (if our analysis is correct). Is there a way to parallelize
the recovery process, or are there configuration options to improve
recovery performance?
2. During JobManager restarts, we occasionally observe uncontrolled
creation of TaskManagers exceeding the required number. These TaskManagers
eventually become unnecessary (all slots are free) and are removed due to a
timeout. Why does this happen? We tried the
*resourcemanager.previous-worker.recovery.timeout
*option, but it had no effect.

вт, 18 мар. 2025 г. в 07:59, Vladislav Keda <vladislavk...@gmail.com>:

> Hi,
>
> We use Native Kubernetes Session cluster with enabled Kubernetes HA for
> running Flink Streaming Jobs. Over time, the number of running jobs
> approached 100 and we noticed that the recovery time (when JM crashes)
> grows exponentially with an increase in the number of jobs. Below are
> graphs of the appearance in the JM logs of *"Job ... switched from state
> CREATED to RUNNING"* when running 80 jobs in different namespaces:
> [image: telegram-cloud-photo-size-2-5388867639654347804-y.jpg][image:
> telegram-cloud-photo-size-2-5388867639654347805-y.jpg]
>
> When manually submitting jobs (without HA recovery), we see a linear
> growth on the 200 job horizon. That is, starting 50 jobs takes on average
> 15 minutes, 100 jobs - 30 minutes, and 200 jobs - about an hour.
>
> When jobs recover the resources (CPU and RAM) of JM and TM are occupied by
> no more than half of its limits. The size of the last checkpoint of each
> job (from which it is restored) takes no more than 20 MB. And it takes no
> more than 5 seconds to download the checkpoint from s3.
>
> So why do we see an exponential increase of job recovery time? Do you need
> any additional information?
>
> P.S. We use Flink 1.18.1
>

Re: Too long jobs recover after JobManager fail with Kubernetes HA

Reply via email to