[ https://issues.apache.org/jira/browse/FLINK-33880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801648#comment-17801648 ]
Yangze Guo commented on FLINK-33880: ------------------------------------ I think it's a valid issue, go ahead. > Introducing Retry Mechanism for Listing TaskManager Pods to Prevent API > Server Connection Failures > -------------------------------------------------------------------------------------------------- > > Key: FLINK-33880 > URL: https://issues.apache.org/jira/browse/FLINK-33880 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes > Affects Versions: 1.17.2 > Reporter: Yuan Huang > Assignee: Yuan Huang > Priority: Major > Attachments: image-2023-12-19-18-41-41-308.png, > image-2023-12-19-18-44-13-623.png, image-2023-12-21-10-12-37-667.png > > > When operating in Kubernetes mode, if the JobManager undergoes a restart, it > attempts to establish a connection with the API server to retrieve the > complete list of TaskManager Pods, facilitating the recovery of previous > TaskManagers. > In the context of a large Kubernetes cluster with potentially thousands of > concurrently running jobs, a scenario may arise where all JobManagers undergo > a restart and subsequently connect to the API server (e.g., during disaster > recovery). This influx of requests may overwhelm the API server, reaching its > maximum capacity and leading to the refusal of some JobManager requests. > Consequently, certain JobManagers may experience failures and initiate > reconnection attempts to the API server. > !image-2023-12-21-10-12-37-667.png|width=609,height=305! > !image-2023-12-19-18-44-13-623.png|width=505,height=206! > To enhance this process, we can propose the implementation of a retry > mechanism. In the event of a failed connection attempt to the API server, > Flink will introduce a waiting period before making subsequent connection > attempts, mitigating the risk of overwhelming the server and improving the > overall resilience of the system. -- This message was sent by Atlassian Jira (v8.20.10#820010)