Hi, hu I am not sure why do you need to start multiple jobmanagers on kubernetes. Just as the manual [1], we use a deployment of 1 to make sure kubernetes detect the crash of jobmanager and start a new one. What we should do is to add the high availability configurations [2] in flink-conf.yaml. You could use the configMap [3] to save your flink-conf.yaml and then mount into to jobmanager pod. Also you could update the flink-conf.yaml in your flink image.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/jobmanager_high_availability.html [3] https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/ 胡逸才 <huyc...@163.com> 于2019年6月28日周五 上午11:09写道: > HI Tan: > I have the same problem with you when running "flink-1.7.2 ON KUBERNATE > HA" mode, may I ask if you have solved this problem? How? After I started > the two jobmanagers normally, when I tried to kill one of them, he could > not restart normally. Both jobmanagers reported this error. The specific > log is as follows: > > > > > 2019-06-28 09:57:57.253 [flink-akka.actor.default-dispatcher-4] WARN > akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote > connection to [null] failed with java.net.ConnectException: Connection > refused: tdh2/192.168.208.55:56529 > 2019-06-28 09:57:57.253 [flink-akka.actor.default-dispatcher-4] WARN > akka.remote.ReliableDeliverySupervisor > flink-akka.remote.default-remote-dispatcher-14 - Association with remote > system [akka.tcp://flink@tdh2:56529] has failed, address is now gated for > [50] ms. Reason: [Association failed with [akka.tcp://flink@tdh2:56529]] > Caused by: [Connection refused: tdh2/192.168.208.55:56529] > 2019-06-28 09:57:57.253 [flink-akka.actor.default-dispatcher-4] WARN > akka.remote.ReliableDeliverySupervisor > flink-akka.remote.default-remote-dispatcher-14 - Association with remote > system [akka.tcp://flink@tdh2:56529] has failed, address is now gated for > [50] ms. Reason: [Association failed with [akka.tcp://flink@tdh2:56529]] > Caused by: [Connection refused: tdh2/192.168.208.55:56529] > 2019-06-28 09:57:57.260 [flink-rest-server-netty-worker-thread-7] ERROR > o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler - Could not > retrieve the redirect address. > java.util.concurrent.CompletionException: > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://flink@tdh2:56529/user/dispatcher#299521377]] after > [10000 ms]. Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". > at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > at > java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593) > at > java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577) > at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474) > at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977) > at > org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772) > at akka.dispatch.OnComplete.internal(Future.scala:258) > at akka.dispatch.OnComplete.internal(Future.scala:256) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186) > at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183) > at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36) > at > org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83) > at > scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44) > at > scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252) > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603) > at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126) > at > scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601) > at > scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109) > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599) > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284) > at > akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236) > at java.lang.Thread.run(Thread.java:748) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://flink@tdh2:56529/user/dispatcher#299521377]] after > [10000 ms]. Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604) > ... 9 common frames omitted > > > >