Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Yang Wang Sun, 01 Nov 2020 22:33:40 -0800

Hi kevin,

Thanks for sharing the information. I will dig into and create a ticket if
necessary.


Best,
Yang

Bohinski, Kevin <kevin_bohin...@comcast.com> 于2020年10月29日周四 上午2:35写道：

> Hi Yang,
>
>
>
> Thanks again for all the help!
>
>
>
> We are still seeing this with 1.11.2 and ZK.
>
> Looks like others are seeing this as well and they found a solution
> https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://cloud.tencent.com/developer/article/1731416&prev=search
>
>
>
> Should this solution be added to 1.12?
>
>
>
> Best
>
> kevin
>
>
>
> On 2020/08/14 02:48:50, Yang Wang <d...@gmail.com> wrote:
>
> > Hi kevin,>
>
> >
>
> > Thanks for sharing more information. You are right. Actually, "too old>
>
> > resource version" is caused by a bug>
>
> > of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have>
>
> > bumped the kubernetes-client version>
>
> > to v4.9.2 in Flink release-1.11. Also it has been backported to release>
>
> > 1.10 and will be included in the next>
>
> > minor release version(1.10.2).>
>
> >
>
> > BTW, if you really want all your jobs recovered when jobmanager
> crashed,>
>
> > you still need to configure the Zookeeper high availability.>
>
> >
>
> > [1]. https://github.com/fabric8io/kubernetes-client/pull/1800>
>
> >
>
> >
>
> > Best,>
>
> > Yang>
>
> >
>
> > Bohinski, Kevin <ke...@comcast.com> 于2020年8月14日周五 上午6:32写道：>
>
> >
>
> > > Might be useful>
>
> > >>
>
> > > https://stackoverflow.com/a/61437982>
>
> > >>
>
> > >>
>
> > >>
>
> > > Best,>
>
> > >>
>
> > > kevin>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > > *From: *"Bohinski, Kevin" <ke...@comcast.com>>
>
> > > *Date: *Thursday, August 13, 2020 at 6:13 PM>
>
> > > *To: *Yang Wang <da...@gmail.com>>
>
> > > *Cc: *"user@flink.apache.org" <us...@flink.apache.org>>
>
> > > *Subject: *Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job>
>
> > > never recovers>
>
> > >>
>
> > >>
>
> > >>
>
> > > Hi>
>
> > >>
>
> > >>
>
> > >>
>
> > > Got the logs on crash, hopefully they help.>
>
> > >>
>
> > >>
>
> > >>
>
> > > 2020-08-13 22:00:40,336 ERROR>
>
> > > org.apache.flink.kubernetes.KubernetesResourceManager        [] -
> Fatal>
>
> > > error occurred in ResourceManager.>
>
> > >>
>
> > > io.fabric8.kubernetes.client.KubernetesClientException: too old
> resource>
>
> > > version: 8617182 (8633230)>
>
> > >>
>
> > >                 at>
>
> > >
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>
>
>
> > > [?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>
>
>
> > > [?:1.8.0_262]>
>
> > >>
>
> > >                 at java.lang.Thread.run(Thread.java:748)
> [?:1.8.0_262]>
>
> > >>
>
> > > 2020-08-13 22:00:40,337 ERROR>
>
> > > org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] -
> Fatal>
>
> > > error occurred in the cluster entrypoint.>
>
> > >>
>
> > > io.fabric8.kubernetes.client.KubernetesClientException: too old
> resource>
>
> > > version: 8617182 (8633230)>
>
> > >>
>
> > >                 at>
>
> > >
> io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>
>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>
>
>
> > > [?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>
>
>
> > > [?:1.8.0_262]>
>
> > >>
>
> > >                 at java.lang.Thread.run(Thread.java:748)
> [?:1.8.0_262]>
>
> > >>
>
> > > 2020-08-13 22:00:40,416 INFO>
>
> > > org.apache.flink.runtime.blob.BlobServer                     [] -
> Stopped>
>
> > > BLOB server at 0.0.0.0:6124>
>
> > >>
>
> > >>
>
> > >>
>
> > > Best,>
>
> > >>
>
> > > kevin>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > > *From: *Yang Wang <da...@gmail.com>>
>
> > > *Date: *Sunday, August 9, 2020 at 10:29 PM>
>
> > > *To: *"Bohinski, Kevin" <ke...@comcast.com>>
>
> > > *Cc: *"user@flink.apache.org" <us...@flink.apache.org>>
>
> > > *Subject: *[EXTERNAL] Re: Native K8S Jobmanager restarts and job
> never>
>
> > > recovers>
>
> > >>
>
> > >>
>
> > >>
>
> > > Hi Kevin,>
>
> > >>
>
> > >>
>
> > >>
>
> > > I think you may not set the high availability configurations in your>
>
> > > native K8s session. Currently, we only>
>
> > >>
>
> > > support zookeeper HA, so you need to add the following configuration.>
>
> > > After the HA is configured, the>
>
> > >>
>
> > > running job, checkpoint and other meta could be stored. When the>
>
> > > jobmanager failover, all the jobs>
>
> > >>
>
> > > could be recovered then. I have tested it could work properly.>
>
> > >>
>
> > >>
>
> > >>
>
> > > high-availability: zookeeper>
>
> > >>
>
> > > high-availability.zookeeper.quorum: zk-client:2181>
>
> > >>
>
> > > high-availability.storageDir: hdfs:///flink/recovery>
>
> > >>
>
> > >>
>
> > >>
>
> > > I know you may not have a zookeeper cluster.You could a zookeeper K8s>
>
> > > operator[1] to deploy a new one.>
>
> > >>
>
> > >>
>
> > >>
>
> > > More over, it is not very convenient to use zookeeper as HA. So a K8s>
>
> > > native HA support[2] is in plan and we>
>
> > >>
>
> > > are trying to finish it in the next major release cycle(1.12).>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > > [1]. https://github.com/pravega/zookeeper-operator>
>
> > > <
> https://urldefense.com/v3/__https:/github.com/pravega/zookeeper-operator__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVWj1J1mzw$>>
>
>
> > >>
>
> > > [2]. https://issues.apache.org/jira/browse/FLINK-12884>
>
> > > <
> https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/FLINK-12884__;!!CQl3mcHX2A!WYnCnwXm-Wu9wKZW8f_xzLnOD6R_o4EMi-s9gJYBegeB-7lUopfm3lexybTzSVVglqEFAw$>>
>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > >>
>
> > > Best,>
>
> > >>
>
> > > Yang>
>
> > >>
>
> > >>
>
> > >>
>
> > > Bohinski, Kevin <ke...@comcast.com> 于2020年8月7日周五 下午11:40写道：>
>
> > >>
>
> > > Hi all,>
>
> > >>
>
> > >>
>
> > >>
>
> > > In our 1.11.1 native k8s session after we submit a job it will run>
>
> > > successfully for a few hours then fail when the jobmanager pod
> restarts.>
>
> > >>
>
> > >>
>
> > >>
>
> > > Relevant logs after restart are attached below. Any suggestions?>
>
> > >>
>
> > >>
>
> > >>
>
> > > Best>
>
> > >>
>
> > > kevin>
>
> > >>
>
> > >>
>
> > >>
>
> > > 2020-08-06 21:50:24,425 INFO>
>
> > > org.apache.flink.kubernetes.KubernetesResourceManager        [] -
> Recovered>
>
> > > 32 pods from previous attempts, current attempt id is 2.>
>
> > >>
>
> > > 2020-08-06 21:50:24,610 DEBUG>
>
> > > org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher
> [] ->
>
> > > Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16,>
>
> > > details: PodStatus(conditions=[PodCondition(lastProbeTime=null,>
>
> > > lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null,>
>
> > > status=True, type=Initialized, additionalProperties={}),>
>
> > > PodCondition(lastProbeTime=null,
> lastTransitionTime=2020-08-06T18:48:37Z,>
>
> > > message=null, reason=null, status=True, type=Ready,>
>
> > > additionalProperties={}), PodCondition(lastProbeTime=null,>
>
> > > lastTransitionTime=2020-08-06T18:48:37Z, message=null, reason=null,>
>
> > > status=True, type=ContainersReady, additionalProperties={}),>
>
> > > PodCondition(lastProbeTime=null,
> lastTransitionTime=2020-08-06T18:48:33Z,>
>
> > > message=null, reason=null, status=True, type=PodScheduled,>
>
> > > additionalProperties={})],>
>
> > > containerStatuses=[ContainerStatus(containerID=docker://REDACTED,>
>
> > > image=REDACTED/flink:1.11.1-scala_2.11-s3-0,>
>
> > > imageID=docker-pullable://REDACTED/flink@sha256:REDACTED,>
>
> > > lastState=ContainerState(running=null, terminated=null, waiting=null,>
>
> > > additionalProperties={}), name=flink-task-manager, ready=true,>
>
> > > restartCount=0, started=true,>
>
> > >
> state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z,>
>
>
> > > additionalProperties={}), terminated=null, waiting=null,>
>
> > > additionalProperties={}), additionalProperties={})],>
>
> > > ephemeralContainerStatuses=[], hostIP=REDACTED,
> initContainerStatuses=[],>
>
> > > message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED,>
>
> > > podIPs=[PodIP(ip=REDACTED, additionalProperties={})],
> qosClass=Guaranteed,>
>
> > > reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})>
>
> > >>
>
> > > 2020-08-06 21:50:24,613 DEBUG>
>
> > > org.apache.flink.kubernetes.KubernetesResourceManager        [] -
> Ignore>
>
> > > TaskManager pod that is already added:>
>
> > > REDACTED-flink-session-taskmanager-1-16>
>
> > >>
>
> > > 2020-08-06 21:50:24,615 INFO>
>
> > > org.apache.flink.kubernetes.KubernetesResourceManager        [] -
> Received>
>
> > > 0 new TaskManager pods. Remaining pending pod requests: 0>
>
> > >>
>
> > > 2020-08-06 21:50:24,631 DEBUG>
>
> > > org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore []
> ->
>
> > > Could not load archived execution graph for job id>
>
> > > 8c76f37962afd87783c65c95387fb828.>
>
> > >>
>
> > > java.util.concurrent.ExecutionException:
> java.io.FileNotFoundException:>
>
> > > Could not find file for archived execution graph>
>
> > > 8c76f37962afd87783c65c95387fb828. This indicates that the file either
> has>
>
> > > been deleted or never written.>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884)>
>
>
> > > ~[?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> java.util.concurrent.CompletableFuture.uniExceptionallyStage(CompletableFuture.java:898)>
>
>
> > > ~[?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> java.util.concurrent.CompletableFuture.exceptionally(CompletableFuture.java:2209)>
>
>
> > > ~[?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.runtime.dispatcher.Dispatcher.requestJob(Dispatcher.java:552)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native>
>
> > > Method) ~[?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)>
>
>
> > > ~[?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)>
>
>
> > > ~[?:1.8.0_262]>
>
> > >>
>
> > >                 at java.lang.reflect.Method.invoke(Method.java:498)>
>
> > > ~[?:1.8.0_262]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:284)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:199)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > >
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)>
>
>
> > > ~[flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > > akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > > akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)>
>
> > > [flink-dist_2.11-1.11.1.jar:1.11.1]>
>
> > >>
>
> > >                 at>
>
> > > scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)>
>
> > > [flink-dist_2.11
>
> [message truncated...]
>

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Reply via email to