[ 
https://issues.apache.org/jira/browse/FLINK-33880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yangze Guo reassigned FLINK-33880:
----------------------------------

    Assignee: Yuan Huang 

> Introducing Retry Mechanism for Listing TaskManager Pods to Prevent API 
> Server Connection Failures
> --------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-33880
>                 URL: https://issues.apache.org/jira/browse/FLINK-33880
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.17.2
>            Reporter: Yuan Huang 
>            Assignee: Yuan Huang 
>            Priority: Major
>         Attachments: image-2023-12-19-18-41-41-308.png, 
> image-2023-12-19-18-44-13-623.png, image-2023-12-21-10-12-37-667.png
>
>
> When operating in Kubernetes mode, if the JobManager undergoes a restart, it 
> attempts to establish a connection with the API server to retrieve the 
> complete list of TaskManager Pods, facilitating the recovery of previous 
> TaskManagers.
> In the context of a large Kubernetes cluster with potentially thousands of 
> concurrently running jobs, a scenario may arise where all JobManagers undergo 
> a restart and subsequently connect to the API server (e.g., during disaster 
> recovery). This influx of requests may overwhelm the API server, reaching its 
> maximum capacity and leading to the refusal of some JobManager requests. 
> Consequently, certain JobManagers may experience failures and initiate 
> reconnection attempts to the API server.
> !image-2023-12-21-10-12-37-667.png|width=609,height=305!
> !image-2023-12-19-18-44-13-623.png|width=505,height=206!
> To enhance this process, we can propose the implementation of a retry 
> mechanism. In the event of a failed connection attempt to the API server, 
> Flink will introduce a waiting period before making subsequent connection 
> attempts, mitigating the risk of overwhelming the server and improving the 
> overall resilience of the system.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to