[jira] [Comment Edited] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

xiaogang zhou (Jira) Tue, 09 Jan 2024 00:54:38 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804638#comment-17804638
 ]


xiaogang zhou edited comment on FLINK-33728 at 1/9/24 8:52 AM:
---------------------------------------------------------------

[~xtsong] In a default FLINK setting, when the KubenetesClient  disconnects 
from KUBE API server, it will try to reconnect for infinitely times. As 
kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat 
ResourceVersionTooOld as a special exception, as it will escape from the normal 
reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for 
kubernetes.transactional-operation.max-retries times, and these retries have 
not interval between them. If the watcher does not recover, the JM will kill it 
self.

 

So I think the problem we are trying to solve is not only to avoid massive 
Flink jobs trying to re-creating watches at the same time.  But also how to 
allow FLINK to continue running even when the KUBE API is in a disorder 
situation. As for most of the times, FLINK TMs do not need to be bothered by a 
bad API server .

 

If you think it is not acceptable to recover the watcher only requesting 
resource, I think another possible way is , we can retry to rewatch pods 
periodically.

 

WDYT? :) 


was (Author: zhoujira86):
[~xtsong] In a default FLINK setting, when the KubenetesClient  disconnects 
from KUBE API server, it will try to reconnect for infinitely times. As 
kubernetes.watch.reconnectLimit is -1. But KubenetesClient treat 
ResourceVersionTooOld as a special exception, as it will escape from the normal 
reconnects. And then it will cause FLINK FlinkKubeClient to retry connect for 
kubernetes.transactional-operation.max-retries times. If the watcher does not 
recover, the JM will kill it self.

 

So I think the problem we are trying to solve is not only to avoid massive 
Flink jobs trying to re-creating watches at the same time.  But also how to 
allow FLINK to continue running even when the KUBE API is in a disorder 
situation. As for most of the times, FLINK TMs do not need to be bothered by a 
bad API server .

 

If you think it is not acceptable to recover the watcher only requesting 
resource, I think another possible way is , we can retry to rewatch pods 
periodically.

 

WDYT? :) 

> do not rewatch when KubernetesResourceManagerDriver watch fail
> --------------------------------------------------------------
>
>                 Key: FLINK-33728
>                 URL: https://issues.apache.org/jira/browse/FLINK-33728
>             Project: Flink
>          Issue Type: New Feature
>          Components: Deployment / Kubernetes
>            Reporter: xiaogang zhou
>            Priority: Major
>              Labels: pull-request-available
>
> I met massive production problem when kubernetes ETCD slow responding happen. 
> After Kube recoverd after 1 hour, Thousands of Flink jobs using 
> kubernetesResourceManagerDriver rewatched when recieving 
> ResourceVersionTooOld,  which caused great pressure on API Server and made 
> API server failed again... 
>  
> I am not sure is it necessary to
> getResourceEventHandler().onError(throwable)
> in  PodCallbackHandlerImpl# handleError method?
>  
> We can just neglect the disconnection of watching process. and try to rewatch 
> once new requestResource called. And we can leverage on the akka heartbeat 
> timeout to discover the TM failure, just like YARN mode do.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (FLINK-33728) do not rewatch when KubernetesResourceManagerDriver watch fail

Reply via email to