Re: About "Flink 1.7.0 HA based on zookeepers "

Yang Wang Thu, 27 Jun 2019 23:12:07 -0700

Hi, hu

I am not sure why do you need to start multiple jobmanagers on kubernetes.
Just as the manual [1], we use a deployment of 1 to make sure kubernetes
detect the crash of jobmanager and start a new one. What we should do is to
add the high availability configurations [2] in flink-conf.yaml. You could
use the configMap [3] to save your flink-conf.yaml and then mount into to
jobmanager pod. Also you could update the flink-conf.yaml in your flink
image.


[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/deployment/kubernetes.html
[2]
https://ci.apache.org/projects/flink/flink-docs-release-1.8/ops/jobmanager_high_availability.html
[3]
https://kubernetes.io/docs/tasks/configure-pod-container/configure-pod-configmap/

胡逸才 <huyc...@163.com> 于2019年6月28日周五 上午11:09写道：

> HI Tan:
> I have the same problem with you when running "flink-1.7.2 ON KUBERNATE
> HA" mode, may I ask if you have solved this problem? How? After I started
> the two jobmanagers normally, when I tried to kill one of them, he could
> not restart normally. Both jobmanagers reported this error. The specific
> log is as follows:
>
>
>
>
> 2019-06-28 09:57:57.253 [flink-akka.actor.default-dispatcher-4] WARN
>  akka.remote.transport.netty.NettyTransport New I/O boss #3 - Remote
> connection to [null] failed with java.net.ConnectException: Connection
> refused: tdh2/192.168.208.55:56529
> 2019-06-28 09:57:57.253 [flink-akka.actor.default-dispatcher-4] WARN
>  akka.remote.ReliableDeliverySupervisor
> flink-akka.remote.default-remote-dispatcher-14 - Association with remote
> system [akka.tcp://flink@tdh2:56529] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@tdh2:56529]]
> Caused by: [Connection refused: tdh2/192.168.208.55:56529]
> 2019-06-28 09:57:57.253 [flink-akka.actor.default-dispatcher-4] WARN
>  akka.remote.ReliableDeliverySupervisor
> flink-akka.remote.default-remote-dispatcher-14 - Association with remote
> system [akka.tcp://flink@tdh2:56529] has failed, address is now gated for
> [50] ms. Reason: [Association failed with [akka.tcp://flink@tdh2:56529]]
> Caused by: [Connection refused: tdh2/192.168.208.55:56529]
> 2019-06-28 09:57:57.260 [flink-rest-server-netty-worker-thread-7] ERROR
> o.a.f.r.rest.handler.legacy.files.StaticFileServerHandler  - Could not
> retrieve the redirect address.
> java.util.concurrent.CompletionException:
> akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@tdh2:56529/user/dispatcher#299521377]] after
> [10000 ms]. Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
> at
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
> at
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
> at
> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
> at
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
> at
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
> at
> org.apache.flink.runtime.concurrent.FutureUtils$1.onComplete(FutureUtils.java:772)
> at akka.dispatch.OnComplete.internal(Future.scala:258)
> at akka.dispatch.OnComplete.internal(Future.scala:256)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:186)
> at akka.dispatch.japi$CallbackBridge.apply(Future.scala:183)
> at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:36)
> at
> org.apache.flink.runtime.concurrent.Executors$DirectExecutionContext.execute(Executors.java:83)
> at
> scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:44)
> at
> scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:252)
> at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:603)
> at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:126)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:601)
> at
> scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
> at
> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:599)
> at
> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(LightArrayRevolverScheduler.scala:329)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.executeBucket$1(LightArrayRevolverScheduler.scala:280)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.nextTick(LightArrayRevolverScheduler.scala:284)
> at
> akka.actor.LightArrayRevolverScheduler$$anon$4.run(LightArrayRevolverScheduler.scala:236)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://flink@tdh2:56529/user/dispatcher#299521377]] after
> [10000 ms]. Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
> at
> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
> ... 9 common frames omitted
>
>
>
>

Re: About "Flink 1.7.0 HA based on zookeepers "

Reply via email to