I have faced this issue, but in 1.4.0 IIRC. This seems to be related to https://issues.apache.org/jira/browse/FLINK-10011. What was the status of the jobs when the main Job Manager has been stopped ?
2018-08-17 17:08 GMT+02:00 Helmut Zechmann <hel...@adeven.com>: > Hi all, > > we have a problem with flink 1.5.2 high availability in standalone mode. > > We have two jobmanagers running. When I shut down the main job manager, > the failover job manager encounters an error during failover. > > Logs: > > > 2018-08-17 14:38:16,478 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] > ms. Reason: [Disassociated] > 2018-08-17 14:38:31,449 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: > seg-1.adjust.com/178.162.219.66:29095 > 2018-08-17 14:38:31,451 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system [akka.tcp:// > fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50] > ms. Reason: [Association failed with [akka.tcp://flink@seg-1. > adjust.com:29095]] Caused by: [Connection refused: > seg-1.adjust.com/178.162.219.66:29095] > 2018-08-17 14:38:41,379 ERROR org.apache.flink.runtime.rest. > handler.legacy.files.StaticFileServerHandler - Could not retrieve the > redirect address. > java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: > Ask timed out on [Actor[akka.tcp://fl...@seg-1.adjust.com:29095/user/ > dispatcher#-1599908403]] after [10000 ms]. Sender[null] sent message of > type "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". > at java.util.concurrent.CompletableFuture.encodeThrowable( > CompletableFuture.java:292) > [... shortened ...] > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka.tcp://fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] > after [10000 ms]. Sender[null] sent message of type > "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage". > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp( > AskSupport.scala:604) > ... 9 more > 2018-08-17 14:38:48,005 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint > - http://seg-2.adjust.com:8083 was granted leadership with > leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e > 2018-08-17 14:38:48,005 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager > - ResourceManager akka.tcp://flink@seg-2.adjust. > com:30169/user/resourcemanager was granted leadership with fencing token > 8de829de14876a367a80d37194b944ee > 2018-08-17 14:38:48,006 INFO org.apache.flink.runtime. > resourcemanager.slotmanager.SlotManager - Starting the SlotManager. > 2018-08-17 14:38:48,007 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher > - Dispatcher akka.tcp://fl...@seg-2.adjust.com:30169/user/dispatcher > was granted leadership with fencing token 684f50f8-327c-47e1-a53c- > 931c4f4ea3e5 > 2018-08-17 14:38:48,007 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher > - Recovering all persisted jobs. > 2018-08-17 14:38:48,021 INFO org.apache.flink.runtime.jobmanager. > ZooKeeperSubmittedJobGraphStore - Recovered SubmittedJobGraph( > b951bbf518bcf6cc031be6d2ccc441bb, null). > 2018-08-17 14:38:48,028 INFO org.apache.flink.runtime.jobmanager. > ZooKeeperSubmittedJobGraphStore - Recovered SubmittedJobGraph( > 06ed64f48fa0a7cffde53b99cbaa073f, null). > 2018-08-17 14:38:48,035 ERROR > org.apache.flink.runtime.entrypoint.ClusterEntrypoint > - Fatal error occurred in the cluster entrypoint. > java.lang.RuntimeException: > org.apache.flink.runtime.client.JobExecutionException: > Could not set up JobManager > at org.apache.flink.util.ExceptionUtils.rethrow( > ExceptionUtils.java:199) > [... shortened ...] > Caused by: org.apache.flink.runtime.client.JobExecutionException: Could > not set up JobManager > at org.apache.flink.runtime.jobmaster.JobManagerRunner.< > init>(JobManagerRunner.java:176) > at org.apache.flink.runtime.dispatcher.Dispatcher$ > DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936) > at org.apache.flink.runtime.dispatcher.Dispatcher. > createJobManagerRunner(Dispatcher.java:291) > at org.apache.flink.runtime.dispatcher.Dispatcher.runJob( > Dispatcher.java:281) > at org.apache.flink.util.function.ConsumerWithException.accept( > ConsumerWithException.java:38) > ... 21 more > Caused by: java.lang.Exception: Cannot set up the user code libraries: > /var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_ > b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f2 > 02b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb (No such file or directory) > at org.apache.flink.runtime.jobmaster.JobManagerRunner.< > init>(JobManagerRunner.java:134) > ... 25 more > Caused by: java.io.FileNotFoundException: /var/lib/flink/ceph/prod/1.5- > batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p- > a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb > (No such file or directory) > at java.io.FileInputStream.open0(Native Method) > [... shortened ...] > ... 25 more > 2018-08-17 14:38:48,036 INFO org.apache.flink.runtime.blob.TransientBlobCache > - Shutting down BLOB cache > 2018-08-17 14:38:48,038 INFO org.apache.flink.runtime.blob.BlobServer > - Stopped BLOB server at 0.0.0.0:27073 > > > Our HA config is: > > > high-availability: zookeeper > high-availability.cluster-id: 1.5-batch > high-availability.storageDir: file:///var/lib/flink/ceph// > prod/1.5-batch/ha_state > high-availability.zookeeper.path.root: /1.5-batch > high-availability.zookeeper.quorum: kafka-4:2181,kafka-5:2181,kafka-6:2181 > > > Any ideas what might be the probleme here? > > > Best, > > Helmut