[ https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804638#comment-17804638 ]
xiaogang zhou edited comment on FLINK-33728 at 1/9/24 8:52 AM: --------------------------------------------------------------- [~xtsong] In a default FLINK setting, when the KubenetesClient disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times, and these retries have not interval between them. If the watcher does not recover, the JM will kill it self. So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time. But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server . If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically. WDYT? :) was (Author: zhoujira86): [~xtsong] In a default FLINK setting, when the KubenetesClient disconnects from KUBE API server, it will try to reconnect for infinitely times. As kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat ResourceVersionTooOld as a special exception, as it will escape from the normal reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for kubernetes.transactional-operation.max-retries times. If the watcher does not recover, the JM will kill it self. So I think the problem we are trying to solve is not only to avoid massive Flink jobs trying to re-creating watches at the same time. But also how to allow FLINK to continue running even when the KUBE API is in a disorder situation. As for most of the times, FLINK TMs do not need to be bothered by a bad API server . If you think it is not acceptable to recover the watcher only requesting resource, I think another possible way is , we can retry to rewatch pods periodically. WDYT? :) > do not rewatch when KubernetesResourceManagerDriver watch fail > -------------------------------------------------------------- > > Key: FLINK-33728 > URL: https://issues.apache.org/jira/browse/FLINK-33728 > Project: Flink > Issue Type: New Feature > Components: Deployment / Kubernetes > Reporter: xiaogang zhou > Priority: Major > Labels: pull-request-available > > I met massive production problem when kubernetes ETCD slow responding happen. > After Kube recoverd after 1 hour, Thousands of Flink jobs using > kubernetesResourceManagerDriver rewatched when recieving > ResourceVersionTooOld, which caused great pressure on API Server and made > API server failed again... > > I am not sure is it necessary to > getResourceEventHandler().onError(throwable) > in PodCallbackHandlerImpl# handleError method? > > We can just neglect the disconnection of watching process. and try to rewatch > once new requestResource called. And we can leverage on the akka heartbeat > timeout to discover the TM failure, just like YARN mode do. -- This message was sent by Atlassian Jira (v8.20.10#820010)