Hi Yang,

I tried to upgrade to the latest v1.13.1 the bug logs seem different. But the 
jobmanager still crashes when I restart one of my three master nodes.

The logs from job manager is:

2021-07-12 19:50:04,854 [ERROR] 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector - Exception 
occurred while acquiring lock 'ConfigMapLock: default - 
streaker-resourcemanager-leader (48aa75e9-cfa1-4a1c-a3d2-a1f143af84b8)'
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
 Unable to update ConfigMapLock
                at 
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.update(ConfigMapLock.java:108)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:156)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$acquire$0(LeaderElector.java:82)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$3(LeaderElector.java:198)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) [?:?]
                at 
java.util.concurrent.FutureTask.runAndReset(FutureTask.java:305) [?:?]
                at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:305)
 [?:?]
                at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
[?:?]
                at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
[?:?]
                at java.lang.Thread.run(Thread.java:834) [?:?]
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Operation: 
[replace]  for kind: [ConfigMap]  with name: [streaker-resourcemanager-leader]  
in namespace: [default]  failed.
                at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:88)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.api.model.DoneableConfigMap.done(DoneableConfigMap.java:26)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.api.model.DoneableConfigMap.done(DoneableConfigMap.java:5)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:92)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:36)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.update(ConfigMapLock.java:106)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                ... 9 more
Caused by: java.io.InterruptedIOException
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http2.Http2Stream.waitForIo(Http2Stream.java:642)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http2.Http2Stream.takeHeaders(Http2Stream.java:150)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http2.Http2Codec.readResponseHeaders(Http2Codec.java:131)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.CallServerInterceptor.intercept(CallServerInterceptor.java:88)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:45)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:126)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:134)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createHttpClient$3(HttpClientUtils.java:109)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:147)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:121)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:254)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall.execute(RealCall.java:92) 
~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:469)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:430)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleReplace(OperationSupport.java:289)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleReplace(OperationSupport.java:269)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleReplace(BaseOperation.java:820)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.lambda$replace$1(HasMetadataOperation.java:86)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.api.model.DoneableConfigMap.done(DoneableConfigMap.java:26)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.api.model.DoneableConfigMap.done(DoneableConfigMap.java:5)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:92)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.dsl.base.HasMetadataOperation.replace(HasMetadataOperation.java:36)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                at 
io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.update(ConfigMapLock.java:106)
 ~[flink-dist_2.12-1.13.1.jar:1.13.1]
                ... 9 more

It seems like, after the master node restart, the running jobmanager and 
service account lost the permission to read and write the 
streaker-resourcemanager-leader configmap. And Flink cluster has to restart the 
existing jobmanager to re-acquire the permission.

The cluster role binding is

Name:         flinkapp-clusterrolebinding

Labels:       app.kubernetes.io/managed-by=skaffold

              skaffold.dev/run-id=ce4ed24a-9c40-4877-8e77-8e8b5561c5d2

Annotations:  meta.helm.sh/release-name: flinkapp

              meta.helm.sh/release-namespace: default

Role:

  Kind:  ClusterRole

  Name:  edit

Subjects:

  Kind            Name           Namespace

  ----            ----           ---------

  ServiceAccount  flinkapp  default

The service account is

Name:                flinkapp

Namespace:           default

Labels:              app.kubernetes.io/managed-by=skaffold

                     skaffold.dev/run-id=ce4ed24a-9c40-4877-8e77-8e8b5561c5d2

Annotations:         meta.helm.sh/release-name: flinkapp

                     meta.helm.sh/release-namespace: default

Image pull secrets:  <none>

Mountable secrets:   flinkapp-token-mpbd5

Tokens:              flinkapp-token-mpbd5

Events:              <none>

Not sure if I am missing any configuration….
Any help would be appreciated!

Best,
Jerome

From: Yang Wang <danrtsey...@gmail.com>
Date: Wednesday, May 26, 2021 at 11:25 PM
To: Jerome Li <l...@vmware.com>
Cc: user@flink.apache.org <user@flink.apache.org>
Subject: Re: Jobmanager Crashes with Kubernetes HA When Restart Kubernetes 
Master Node
I think your attached exception has been fixed via FLINK-22597[1]. Could you 
please have a try with the latest version.

Moreover, it is not the desired Flink behavior that TaskManager could not 
retrieve the new JobManager address and re-register successfully. I think you 
need to share
the staled TaskManager logs so that we could move forward the debugging.


[1]. 
https://issues.apache.org/jira/browse/FLINK-22597<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-22597&data=04%7C01%7Clije%40vmware.com%7C51be27c1e2074c7f654f08d920d84118%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637576935463682503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=yuIhmtfXPnb8DN4r2La1UTkE9e4Xh%2BS05omS0J93LGE%3D&reserved=0>

Best,
Yang

Jerome Li <l...@vmware.com<mailto:l...@vmware.com>> 于2021年5月27日周四 上午4:54写道:
Hi Yang,

Thanks for getting back to me.

By “restart master node”, I mean do “kubctl get nodes” to find the node’s role 
as master and “ssh” into one of master nodes as ubuntu user. Then run “sudo 
/sbin/reboot -f” to restart the master node.

It looks like The JobManager would cancel the running job and log this after 
that.

2021-05-26 18:28:37,997 [INFO] 
org.apache.flink.runtime.executiongraph.ExecutionGraph       - Discarding the 
results produced by task execution 34eb9f5009dc7cf07117e720e7d393de.

2021-05-26 18:28:37,999 [INFO] 
org.apache.flink.runtime.checkpoint.DefaultCompletedCheckpointStore - Suspending

2021-05-26 18:28:37,999 [INFO] 
org.apache.flink.kubernetes.highavailability.KubernetesCheckpointIDCounter - 
Shutting down.

2021-05-26 18:28:38,000 [INFO] 
org.apache.flink.runtime.executiongraph.ExecutionGraph       - Job 
74fc5c858e50f5efc91db9ee16c17a8c has been suspended.

2021-05-26 18:28:38,007 [INFO] 
org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl     - Suspending 
SlotPool.

2021-05-26 18:28:38,007 [INFO] org.apache.flink.runtime.jobmaster.JobMaster     
            - Close ResourceManager connection 
5bac86fb0b5c984ef429225b8de82cc0: JobManager is no longer the leader..

2021-05-26 18:28:38,019 [INFO] 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl      - JobManager 
runner for job hogger (74fc5c858e50f5efc91db9ee16c17a8c) was granted leadership 
with session id 14b9004a-3807-42e8-ac03-c0d77efe5611 at 
akka.tcp://flink@hoggerflink-jobmanager:6123/user/rpc/jobmanager_2.

2021-05-26 18:28:38,292 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,292 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,292 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,293 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,293 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,293 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,293 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,293 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.LocalFencedMessage 
until processing is started.

2021-05-26 18:28:38,293 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.LocalFencedMessage 
until processing is started.

2021-05-26 18:28:38,295 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.LocalFencedMessage 
until processing is started.

2021-05-26 18:28:38,295 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.LocalFencedMessage 
until processing is started.

2021-05-26 18:28:38,295 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,296 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.RemoteFencedMessage 
until processing is started.

2021-05-26 18:28:38,296 [INFO] 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         - The rpc endpoint 
org.apache.flink.runtime.jobmaster.JobMaster has not been started yet. 
Discarding message org.apache.flink.runtime.rpc.messages.LocalFencedMessage 
until processing is started.

2021-05-26 18:28:38,299 [ERROR] 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint        - Fatal error 
occurred in the cluster entrypoint.

org.apache.flink.util.FlinkException: JobMaster for job 
74fc5c858e50f5efc91db9ee16c17a8c failed.

       at 
org.apache.flink.runtime.dispatcher.Dispatcher.jobMasterFailed(Dispatcher.java:887)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.dispatcher.Dispatcher.dispatcherJobFailed(Dispatcher.java:465)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$runJob$3(Dispatcher.java:426)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
java.util.concurrent.CompletableFuture.uniHandle(CompletableFuture.java:930) 
~[?:?]

       at 
java.util.concurrent.CompletableFuture$UniHandle.tryFire(CompletableFuture.java:907)
 ~[?:?]

       at 
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:478)
 ~[?:?]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:440)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:208)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:77)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:158)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.actor.Actor.aroundReceive(Actor.scala:517) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.actor.Actor.aroundReceive$(Actor.scala:515) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.actor.ActorCell.invoke(ActorCell.scala:561) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.Mailbox.run(Mailbox.scala:225) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.Mailbox.exec(Mailbox.scala:235) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 
[flink-dist_2.12-1.12.2.jar:1.12.2]

Caused by: org.apache.flink.util.FlinkException: Could not start the job 
manager.

       at 
org.apache.flink.runtime.jobmaster.JobManagerRunnerImpl.lambda$handleException$7(JobManagerRunnerImpl.java:456)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859)
 ~[?:?]

       at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837)
 ~[?:?]

       at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]

       at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]

       at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1044)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.OnComplete.internal(Future.scala:263) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.OnComplete.internal(Future.scala:261) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.actor.ActorRef.tell(ActorRef.scala:126) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleCallAsync(AkkaRpcActor.java:423)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:210)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:100)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       ... 21 more

Caused by: java.util.concurrent.CompletionException: 
java.lang.IllegalStateException: DefaultLeaderRetrievalService can only be 
started once.

       at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:331)
 ~[?:?]

       at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:346)
 ~[?:?]

       at 
java.util.concurrent.CompletableFuture$UniAccept.tryFire(CompletableFuture.java:704)
 ~[?:?]

       at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) 
~[?:?]

       at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:2088)
 ~[?:?]

       at 
org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:1044)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.OnComplete.internal(Future.scala:263) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.OnComplete.internal(Future.scala:261) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:191) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:73)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:68) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:284)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:284)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:284) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:573) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at akka.actor.ActorRef.tell(ActorRef.scala:126) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleCallAsync(AkkaRpcActor.java:423)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:210)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:100)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       ... 21 more

Caused by: java.lang.IllegalStateException: DefaultLeaderRetrievalService can 
only be started once.

       at 
org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService.start(DefaultLeaderRetrievalService.java:89)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.jobmaster.JobMaster.startJobMasterServices(JobMaster.java:891)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.jobmaster.JobMaster.startJobExecution(JobMaster.java:864)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.jobmaster.JobMaster.lambda$start$1(JobMaster.java:381) 
~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleCallAsync(AkkaRpcActor.java:419)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:210)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       at 
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:100)
 ~[flink-dist_2.12-1.12.2.jar:1.12.2]

       ... 21 more

2021-05-26 18:28:38,310 [INFO] org.apache.flink.runtime.blob.BlobServer         
            - Stopped BLOB server at 
0.0.0.0:6124<https://nam04.safelinks.protection.outlook.com/?url=http%3A%2F%2F0.0.0.0%3A6124%2F&data=04%7C01%7Clije%40vmware.com%7C51be27c1e2074c7f654f08d920d84118%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637576935463682503%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=XYQANdMElAXeG7sW3dXcseadJfh5HQO4qCI2AihK0bI%3D&reserved=0>

Eventually, it gets back to work but sometime not. Some of the taskmanager not 
cannot identify the jobmanager address. I have to manually restart the staled 
taskmanager.

Is this the desired Flink behaviors? Or is it a bug? Or if I am missing 
something?

Best,
Jerome


From: Yang Wang <danrtsey...@gmail.com<mailto:danrtsey...@gmail.com>>
Date: Tuesday, May 25, 2021 at 1:03 AM
To: Jerome Li <l...@vmware.com<mailto:l...@vmware.com>>
Cc: user@flink.apache.org<mailto:user@flink.apache.org> 
<user@flink.apache.org<mailto:user@flink.apache.org>>
Subject: Re: Jobmanager Crashes with Kubernetes HA When Restart Kubernetes 
Master Node
By "restart master node", do you mean to restart the K8s master component(e.g. 
APIServer, ETCD, etc.)?

Even though the master components are restarted, the Flink JobManager and 
TaskManager should eventually get to work.
Could you please share the JobManager logs so that we could debug why it 
crashed.


Best,
Yang

Jerome Li <l...@vmware.com<mailto:l...@vmware.com>> 于2021年5月25日周二 上午3:43写道:
Hi,

I am running Flink v1.12.2 in Standalone mode on Kubernetes. I set Kubernetes 
native as HA.

The HA works well when either jobmanager or taskmanager pod lost or crashes.

But, when I restart master node, jobmanager pod will always crash and restart. 
This results in the entire Flink cluster restart and most of taskmanager pod 
will restart as well.

I didn’t see this issue when using zookeeper as HA. Not sure if this is a bug 
should be handle or there is some work around.


Below is my Flink setting
Job-Manager

flink-conf.yaml:

----

jobmanager.rpc.address: streakerflink-jobmanager



high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory

high-availability.cluster-id: /streaker

high-availability.jobmanager.port: 6123

high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink

kubernetes.cluster-id: streaker



rest.address: streakerflink-jobmanager

rest.bind-port: 8081

rest.port: 8081



state.checkpoints.dir: hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink/streaker



blob.server.port: 6124

metrics.internal.query-service.port: 6125

metrics.reporters: prom

metrics.reporter.prom.class: 
org.apache.flink.metrics.prometheus.PrometheusReporter

metrics.reporter.prom.port: 9999



restart-strategy: fixed-delay

restart-strategy.fixed-delay.attempts: 2147483647

restart-strategy.fixed-delay.delay: 5 s



jobmanager.memory.process.size: 1768m



parallelism.default: 1



task.cancellation.timeout: 2000



web.log.path: /opt/flink/log/output.log

jobmanager.web.log.path: /opt/flink/log/output.log



web.submit.enable: false

Task-Manager

flink-conf.yaml:

----

jobmanager.rpc.address: streakerflink-jobmanager



high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory

high-availability.cluster-id: /streaker

high-availability.storageDir: hdfs://hdfs-namenode-0.hdfs-namenode:8020/flink

kubernetes.cluster-id: streaker



taskmanager.network.bind-policy: ip



taskmanager.data.port: 6121

taskmanager.rpc.port: 6122



restart-strategy: fixed-delay

restart-strategy.fixed-delay.attempts: 2147483647

restart-strategy.fixed-delay.delay: 5 s



taskmanager.memory.task.heap.size: 9728m

taskmanager.memory.framework.off-heap.size: 512m

taskmanager.memory.managed.size: 512m

taskmanager.memory.jvm-metaspace.size: 256m

taskmanager.memory.jvm-overhead.max: 3g

taskmanager.memory.jvm-overhead.fraction: 0.035

taskmanager.memory.network.fraction: 0.03

taskmanager.memory.network.max: 3g

taskmanager.numberOfTaskSlots: 1



taskmanager.jvm-exit-on-oom: true



metrics.internal.query-service.port: 6125

metrics.reporters: prom

metrics.reporter.prom.class: 
org.apache.flink.metrics.prometheus.PrometheusReporter

metrics.reporter.prom.port: 9999



web.log.path: /opt/flink/log/output.log

taskmanager.log.path: /opt/flink/log/output.log



task.cancellation.timeout: 2000

Any help will be appreciated!

Thanks,
Jerome

Reply via email to