Hi kevin, Thanks for sharing the information. I will dig into and create a ticket if necessary.
Best, Yang Bohinski, Kevin <kevin_bohin...@comcast.com> 于2020年10月29日周四 上午2:35写道: > Hi Yang, > > > > Thanks again for all the help! > > > > We are still seeing this with 1.11.2 and ZK. > > Looks like others are seeing this as well and they found a solution > https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://cloud.tencent.com/developer/article/1731416&prev=search > > > > Should this solution be added to 1.12? > > > > Best > > kevin > > > > On 2020/08/14 02:48:50, Yang Wang <d...@gmail.com> wrote: > > > Hi kevin,> > > > > > > Thanks for sharing more information. You are right. Actually, "too old> > > > resource version" is caused by a bug> > > > of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have> > > > bumped the kubernetes-client version> > > > to v4.9.2 in Flink release-1.11. Also it has been backported to release> > > > 1.10 and will be included in the next> > > > minor release version(1.10.2).> > > > > > > BTW, if you really want all your jobs recovered when jobmanager > crashed,> > > > you still need to configure the Zookeeper high availability.> > > > > > > [1]. https://github.com/fabric8io/kubernetes-client/pull/1800> > > > > > > > > > Best,> > > > Yang> > > > > > > Bohinski, Kevin <ke...@comcast.com> 于2020年8月14日周五 上午6:32写道:> > > > > > > > Might be useful> > > > >> > > > > https://stackoverflow.com/a/61437982> > > > >> > > > >> > > > >> > > > > Best,> > > > >> > > > > kevin> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > *From: *"Bohinski, Kevin" <ke...@comcast.com>> > > > > *Date: *Thursday, August 13, 2020 at 6:13 PM> > > > > *To: *Yang Wang <da...@gmail.com>> > > > > *Cc: *"user@flink.apache.org" <us...@flink.apache.org>> > > > > *Subject: *Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job> > > > > never recovers> > > > >> > > > >> > > > >> > > > > Hi> > > > >> > > > >> > > > >> > > > > Got the logs on crash, hopefully they help.> > > > >> > > > >> > > > >> > > > > 2020-08-13 22:00:40,336 ERROR> > > > > org.apache.flink.kubernetes.KubernetesResourceManager [] - > Fatal> > > > > error occurred in ResourceManager.> > > > >> > > > > io.fabric8.kubernetes.client.KubernetesClientException: too old > resource> > > > > version: 8617182 (8633230)> > > > >> > > > > at> > > > > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)> > > > > > [?:1.8.0_262]> > > > >> > > > > at> > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)> > > > > > [?:1.8.0_262]> > > > >> > > > > at java.lang.Thread.run(Thread.java:748) > [?:1.8.0_262]> > > > >> > > > > 2020-08-13 22:00:40,337 ERROR> > > > > org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - > Fatal> > > > > error occurred in the cluster entrypoint.> > > > >> > > > > io.fabric8.kubernetes.client.KubernetesClientException: too old > resource> > > > > version: 8617182 (8633230)> > > > >> > > > > at> > > > > > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)> > > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)> > > > > > [?:1.8.0_262]> > > > >> > > > > at> > > > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)> > > > > > [?:1.8.0_262]> > > > >> > > > > at java.lang.Thread.run(Thread.java:748) > [?:1.8.0_262]> > > > >> > > > > 2020-08-13 22:00:40,416 INFO> > > > > org.apache.flink.runtime.blob.BlobServer [] - > Stopped> > > > > BLOB server at 0.0.0.0:6124> > > > >> > > > >> > > > >> > > > > Best,> > > > >> > > > > kevin> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > *From: *Yang Wang <da...@gmail.com>> > > > > *Date: *Sunday, August 9, 2020 at 10:29 PM> > > > > *To: *"Bohinski, Kevin" <ke...@comcast.com>> > > > > *Cc: *"user@flink.apache.org" <us...@flink.apache.org>> > > > > *Subject: *[EXTERNAL] Re: Native K8S Jobmanager restarts and job > never> > > > > recovers> > > > >> > > > >> > > > >> > > > > Hi Kevin,> > > > >> > > > >> > > > >> > > > > I think you may not set the high availability configurations in your> > > > > native K8s session. Currently, we only> > > > >> > > > > support zookeeper HA, so you need to add the following configuration.> > > > > After the HA is configured, the> > > > >> > > > > running job, checkpoint and other meta could be stored. When the> > > > > jobmanager failover, all the jobs> > > > >> > > > > could be recovered then. I have tested it could work properly.> > > > >> > > > >> > > > >> > > > > high-availability: zookeeper> > > > >> > > > > high-availability.zookeeper.quorum: zk-client:2181> > > > >> > > > > high-availability.storageDir: hdfs:///flink/recovery> > > > >> > > > >> > > > >> > > > > I know you may not have a zookeeper cluster.You could a zookeeper K8s> > > > > operator[1] to deploy a new one.> > > > >> > > > >> > > > >> > > > > More over, it is not very convenient to use zookeeper as HA. So a K8s> > > > > native HA support[2] is in plan and we> > > > >> > > > > are trying to finish it in the next major release cycle(1.12).> > > > >> > > > >> > > > >> > > > >> > > > >> > > > > [1]. https://github.com/pravega/zookeeper-operator> > > > > < > https://urldefense.com/v3/__https:/github.com/pravega/zookeeper-operator__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVWj1J1mzw$>> > > > > >> > > > > [2]. https://issues.apache.org/jira/browse/FLINK-12884> > > > > < > https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-12884__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVVglqEFAw$>> > > > > >> > > > >> > > > >> > > > >> > > > >> > > > > Best,> > > > >> > > > > Yang> > > > >> > > > >> > > > >> > > > > Bohinski, Kevin <ke...@comcast.com> 于2020年8月7日周五 下午11:40写道:> > > > >> > > > > Hi all,> > > > >> > > > >> > > > >> > > > > In our 1.11.1 native k8s session after we submit a job it will run> > > > > successfully for a few hours then fail when the jobmanager pod > restarts.> > > > >> > > > >> > > > >> > > > > Relevant logs after restart are attached below. Any suggestions?> > > > >> > > > >> > > > >> > > > > Best> > > > >> > > > > kevin> > > > >> > > > >> > > > >> > > > > 2020-08-06 21:50:24,425 INFO> > > > > org.apache.flink.kubernetes.KubernetesResourceManager [] - > Recovered> > > > > 32 pods from previous attempts, current attempt id is 2.> > > > >> > > > > 2020-08-06 21:50:24,610 DEBUG> > > > > org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher > [] -> > > > > Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16,> > > > > details: PodStatus(conditions=[PodCondition(lastProbeTime=null,> > > > > lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null,> > > > > status=True, type=Initialized, additionalProperties={}),> > > > > PodCondition(lastProbeTime=null, > lastTransitionTime=2020-08-06T18:48:37Z,> > > > > message=null, reason=null, status=True, type=Ready,> > > > > additionalProperties={}), PodCondition(lastProbeTime=null,> > > > > lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null,> > > > > status=True, type=ContainersReady, additionalProperties={}),> > > > > PodCondition(lastProbeTime=null, > lastTransitionTime=2020-08-06T18:48:33Z,> > > > > message=null, reason=null, status=True, type=PodScheduled,> > > > > additionalProperties={})],> > > > > containerStatuses=[ContainerStatus(containerID=docker://REDACTED,> > > > > image=REDACTED/flink:1.11.1-scala_2.11-s3-0,> > > > > imageID=docker-pullable://REDACTED/flink@sha256:REDACTED,> > > > > lastState=ContainerState(running=null, terminated=null, waiting=null,> > > > > additionalProperties={}), name=flink-task-manager, ready=true,> > > > > restartCount=0, started=true,> > > > > > state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z,> > > > > > additionalProperties={}), terminated=null, waiting=null,> > > > > additionalProperties={}), additionalProperties={})],> > > > > ephemeralContainerStatuses=[], hostIP=REDACTED, > initContainerStatuses=[],> > > > > message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED,> > > > > podIPs=[PodIP(ip=REDACTED, additionalProperties={})], > qosClass=Guaranteed,> > > > > reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})> > > > >> > > > > 2020-08-06 21:50:24,613 DEBUG> > > > > org.apache.flink.kubernetes.KubernetesResourceManager [] - > Ignore> > > > > TaskManager pod that is already added:> > > > > REDACTED-flink-session-taskmanager-1-16> > > > >> > > > > 2020-08-06 21:50:24,615 INFO> > > > > org.apache.flink.kubernetes.KubernetesResourceManager [] - > Received> > > > > 0 new TaskManager pods. Remaining pending pod requests: 0> > > > >> > > > > 2020-08-06 21:50:24,631 DEBUG> > > > > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] > -> > > > > Could not load archived execution graph for job id> > > > > 8c76f37962afd87783c65c95387fb828.> > > > >> > > > > java.util.concurrent.ExecutionException: > java.io.FileNotFoundException:> > > > > Could not find file for archived execution graph> > > > > 8c76f37962afd87783c65c95387fb828. This indicates that the file either > has> > > > > been deleted or never written.> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)> > > > > > ~[?:1.8.0_262]> > > > >> > > > > at> > > > > > java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898)> > > > > > ~[?:1.8.0_262]> > > > >> > > > > at> > > > > > java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209)> > > > > > ~[?:1.8.0_262]> > > > >> > > > > at> > > > > > org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at > sun.reflect.NativeMethodAccessorImpl.invoke0(Native> > > > > Method) ~[?:1.8.0_262]> > > > >> > > > > at> > > > > > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)> > > > > > ~[?:1.8.0_262]> > > > >> > > > > at> > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)> > > > > > ~[?:1.8.0_262]> > > > >> > > > > at java.lang.reflect.Method.invoke(Method.java:498)> > > > > ~[?:1.8.0_262]> > > > >> > > > > at> > > > > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)> > > > > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)> > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)> > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)> > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)> > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)> > > > > [flink-dist_2.11-1.11.1.jar:1.11.1]> > > > >> > > > > at> > > > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)> > > > > [flink-dist_2.11 > > [message truncated...] >