Hi All, On long lived session clusters we are seeing a k8s error `Error while watching the ConfigMap`. Good news is it looks like `too old resource version` issue is fixed :).
Logs are attached below. Any tips? best Kevin 2021-02-11 07:55:15,249 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 4 for job 58ec7a029cd31ad057e25479a9979cb4 (202852094 bytes in 49274 ms). 2021-02-11 08:00:15,732 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 5 (type=CHECKPOINT) @ 1613030415249 for job 58ec7a029cd31ad057e25479a9979cb4. 2021-02-11 08:00:25,446 ERROR org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Fatal error occurred in ResourceManager. org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-02-11 08:00:25,456 ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred in the cluster entrypoint. org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while watching the ConfigMap JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader at org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) [flink-dist_2.12-1.12.1.jar:1.12.1] at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) [flink-dist_2.12-1.12.1.jar:1.12.1] at org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [flink-dist_2.12-1.12.1.jar:1.12.1] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_282] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_282] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] 2021-02-11 08:00:25,487 INFO org.apache.flink.runtime.blob.BlobServer [] - Stopped BLOB server at 0.0.0.0:6124