Hi Kevin, Unfortunately, the root cause for the error is missing. I can only guess but it could indeed be FLINK-20417 [1]. If this is the case, then the problem should be fixed with the upcoming Flink 1.12.2 version. It should be released next week hopefully. If it should be a different problem, then we will know better because Flink 1.12.2 will fix the problem with swallowing the root cause. So I would highly recommend upgrading once the next bug fix release has been released.
[1] https://issues.apache.org/jira/browse/FLINK-20417 Cheers, Till On Thu, Feb 11, 2021 at 9:21 AM Bohinski, Kevin <kevin_bohin...@comcast.com> wrote: > Hi All, > > On long lived session clusters we are seeing a k8s error `Error while > watching the ConfigMap`. > Good news is it looks like `too old resource version` issue is fixed :). > > Logs are attached below. Any tips? > > best > Kevin > > > 2021-02-11 07:55:15,249 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 4 for job 58ec7a029cd31ad057e25479a9979cb4 (202852094 bytes in > 49274 ms). > 2021-02-11 08:00:15,732 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - > Triggering checkpoint 5 (type=CHECKPOINT) @ 1613030415249 for job > 58ec7a029cd31ad057e25479a9979cb4. > 2021-02-11 08:00:25,446 ERROR > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Fatal error occurred in ResourceManager. > org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error > while watching the ConfigMap > JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader > at > org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_282] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_282] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] > 2021-02-11 08:00:25,456 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal > error occurred in the cluster entrypoint. > org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error > while watching the ConfigMap > JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader > at > org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) > [flink-dist_2.12-1.12.1.jar:1.12.1] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_282] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_282] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282] > 2021-02-11 08:00:25,487 INFO org.apache.flink.runtime.blob.BlobServer > [] - Stopped BLOB server at 0.0.0.0:6124 >