[ 
https://issues.apache.org/jira/browse/FLINK-20417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17276004#comment-17276004
 ] 

Emilien Kenler commented on FLINK-20417:
----------------------------------------

This issue happens event when the APIServer is not restarted.
We are running Kubernetes on Amazon EKS, and we have this exception about once 
per hour, causing our job managers to restart.

```
2021-01-31 21:57:54,523 ERROR 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Fatal error occurred in ResourceManager.2021-01-31 21:57:54,523 ERROR 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Fatal error occurred in 
ResourceManager.io.fabric8.kubernetes.client.KubernetesClientException: too old 
resource version: 292832312 (294269953) at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_275] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_275] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]2021-01-31 
21:57:54,528 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint        
[] - Fatal error occurred in the cluster 
entrypoint.io.fabric8.kubernetes.client.KubernetesClientException: too old 
resource version: 292832312 (294269953) at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 [flink-dist_2.12-1.12.1.jar:1.12.1] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_275] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_275] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]

```

 

 

> Handle "Too old resource version" exception in Kubernetes watch more 
> gracefully
> -------------------------------------------------------------------------------
>
>                 Key: FLINK-20417
>                 URL: https://issues.apache.org/jira/browse/FLINK-20417
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.11.2, 1.12.0
>            Reporter: Yang Wang
>            Priority: Major
>             Fix For: 1.13.0
>
>
> Currently, when the watcher(pods watcher, configmap watcher) is closed with 
> exception, we will call {{WatchCallbackHandler#handleFatalError}}. And this 
> could cause JobManager terminating and then failover.
> For most cases, this is correct. But not for "too old resource version" 
> exception. See more information here[1]. Usually this exception could happen 
> when the APIServer is restarted. And we just need to create a new watch and 
> continue to do the pods/configmap watching. This could help the Flink cluster 
> reducing the impact of K8s cluster restarting.
>  
> The issue is inspired by this technical article[2]. Thanks the guys from 
> tencent for the debugging. Note this is a Chinese documentation.
>  
> [1]. 
> [https://stackoverflow.com/questions/61409596/kubernetes-too-old-resource-version]
> [2]. [https://cloud.tencent.com/developer/article/1731416]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to