Re: Flink Jobmanager Failover in HA mode

Dominik Wosiński Fri, 17 Aug 2018 11:04:27 -0700

I have faced this issue, but in 1.4.0 IIRC. This seems to be related to
https://issues.apache.org/jira/browse/FLINK-10011. What was the status of
the jobs when the main Job Manager has been stopped ?


2018-08-17 17:08 GMT+02:00 Helmut Zechmann <[email protected]>:

> Hi all,
>
> we have a problem with flink 1.5.2 high availability in standalone mode.
>
> We have two jobmanagers running. When I shut down the main job manager,
> the failover job manager encounters an error during failover.
>
> Logs:
>
>
> 2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system [akka.tcp://
> [email protected]:29095] has failed, address is now gated for [50]
> ms. Reason: [Disassociated]
> 2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport
>                   - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused:
> seg-1.adjust.com/178.162.219.66:29095
> 2018-08-17 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor
>                   - Association with remote system [akka.tcp://
> [email protected]:29095] has failed, address is now gated for [50]
> ms. Reason: [Association failed with [akka.tcp://flink@seg-1.
> adjust.com:29095]] Caused by: [Connection refused:
> seg-1.adjust.com/178.162.219.66:29095]
> 2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest.
> handler.legacy.files.StaticFileServerHandler  - Could not retrieve the
> redirect address.
> java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException:
> Ask timed out on [Actor[akka.tcp://[email protected]:29095/user/
> dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of
> type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>         at java.util.concurrent.CompletableFuture.encodeThrowable(
> CompletableFuture.java:292)
>         [... shortened ...]
> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
> [Actor[akka.tcp://[email protected]:29095/user/dispatcher#-1599908403]]
> after [10000 ms]. Sender[null] sent message of type
> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>         at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(
> AskSupport.scala:604)
>         ... 9 more
> 2018-08-17 14:38:48,005 INFO  
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint
>   - http://seg-2.adjust.com:8083 was granted leadership with
> leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
> 2018-08-17 14:38:48,005 INFO  
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
> - ResourceManager akka.tcp://[email protected].
> com:30169/user/resourcemanager was granted leadership with fencing token
> 8de829de14876a367a80d37194b944ee
> 2018-08-17 14:38:48,006 INFO  org.apache.flink.runtime.
> resourcemanager.slotmanager.SlotManager  - Starting the SlotManager.
> 2018-08-17 14:38:48,007 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>     - Dispatcher akka.tcp://[email protected]:30169/user/dispatcher
> was granted leadership with fencing token 684f50f8-327c-47e1-a53c-
> 931c4f4ea3e5
> 2018-08-17 14:38:48,007 INFO  
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>     - Recovering all persisted jobs.
> 2018-08-17 14:38:48,021 INFO  org.apache.flink.runtime.jobmanager.
> ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(
> b951bbf518bcf6cc031be6d2ccc441bb, null).
> 2018-08-17 14:38:48,028 INFO  org.apache.flink.runtime.jobmanager.
> ZooKeeperSubmittedJobGraphStore  - Recovered SubmittedJobGraph(
> 06ed64f48fa0a7cffde53b99cbaa073f, null).
> 2018-08-17 14:38:48,035 ERROR 
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint
>        - Fatal error occurred in the cluster entrypoint.
> java.lang.RuntimeException: 
> org.apache.flink.runtime.client.JobExecutionException:
> Could not set up JobManager
>         at org.apache.flink.util.ExceptionUtils.rethrow(
> ExceptionUtils.java:199)
>         [... shortened ...]
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could
> not set up JobManager
>         at org.apache.flink.runtime.jobmaster.JobManagerRunner.<
> init>(JobManagerRunner.java:176)
>         at org.apache.flink.runtime.dispatcher.Dispatcher$
> DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
>         at org.apache.flink.runtime.dispatcher.Dispatcher.
> createJobManagerRunner(Dispatcher.java:291)
>         at org.apache.flink.runtime.dispatcher.Dispatcher.runJob(
> Dispatcher.java:281)
>         at org.apache.flink.util.function.ConsumerWithException.accept(
> ConsumerWithException.java:38)
>         ... 21 more
> Caused by: java.lang.Exception: Cannot set up the user code libraries:
> /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_
> b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f2
> 02b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory)
>         at org.apache.flink.runtime.jobmaster.JobManagerRunner.<
> init>(JobManagerRunner.java:134)
>         ... 25 more
> Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5-
> batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-
> a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb
> (No such file or directory)
>         at java.io.FileInputStream.open0(Native Method)
>         [... shortened ...]
>         ... 25 more
> 2018-08-17 14:38:48,036 INFO  org.apache.flink.runtime.blob.TransientBlobCache
>             - Shutting down BLOB cache
> 2018-08-17 14:38:48,038 INFO  org.apache.flink.runtime.blob.BlobServer
>                   - Stopped BLOB server at 0.0.0.0:27073
>
>
> Our HA config is:
>
>
> high-availability: zookeeper
> high-availability.cluster-id: 1.5-batch
> high-availability.storageDir: file:///var/lib/flink/ceph//
> prod/1.5-batch/ha_state
> high-availability.zookeeper.path.root: /1.5-batch
> high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181
>
>
> Any ideas what might be the probleme here?
>
>
> Best,
>
> Helmut

Re: Flink Jobmanager Failover in HA mode

Reply via email to