Attaching the job manager log for reference. 2020-03-22 11:39:02,693 WARN org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - Error while retrieving the leader gateway. Retrying to connect to akka.tcp://flink@host1:28681/user/dispatcher. 2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681] 2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: host1/ipaddress1:28681 2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://flink@host1:28681] has failed, address is now gated for [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: host1/ipaddress1:28681]
Thanks, Dinesh On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <dineshj...@gmail.com> wrote: > Hi all, > We have single job yarn flink cluster setup with High Availability. > Sometimes job manager failure successfully restarts next attempt from > current checkpoint. > But occasionally we are getting below error. > > {"errors":["Service temporarily unavailable due to an ongoing leader > election. Please refresh."]} > > Hadoop version using : Hadoop 2.7.1.2.4.0.0-169 > > Flink version: flink-1.7.2 > > Zookeeper version: 3.4.6-169--1 > > > *Below is the flink configuration* > > high-availability: zookeeper > > high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181 > > high-availability.storageDir: hdfs:///flink/ha > > high-availability.zookeeper.path.root: /flink > > yarn.application-attempts: 10 > > state.backend: rocksdb > > state.checkpoints.dir: hdfs:///flink/checkpoint > > state.savepoints.dir: hdfs:///flink/savepoint > > jobmanager.execution.failover-strategy: region > > restart-strategy: failure-rate > > restart-strategy.failure-rate.max-failures-per-interval: 3 > > restart-strategy.failure-rate.failure-rate-interval: 5 min > > restart-strategy.failure-rate.delay: 10 s > > > > Can someone let know if I am missing something or is it a known issue? > > Is it something related to hostname ip mapping issue or zookeeper version > issue? > > Thanks, > > Dinesh > > > > > > > > > > > > > > > > > > > > > > > > > > > >