Hi Dinesh, If the current leader crashes (e.g. due to network failures) then getting these messages do not look like a problem during the leader re-election. They look to me just as warnings that caused failover.
Do you observe any problem with your application? Does the failover not work, e.g. no leader is elected or a job is not restarted after the current leader failure? Best, Andrey On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <dineshj...@gmail.com> wrote: > Attaching the job manager log for reference. > > 2020-03-22 11:39:02,693 WARN > org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - > Error while retrieving the leader gateway. Retrying to connect to > akka.tcp://flink@host1:28681/user/dispatcher. > 2020-03-22 11:39:02,724 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:02,791 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:02,861 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:02,931 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:03,001 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:03,071 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:03,141 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:03,211 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:03,281 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:03,351 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > 2020-03-22 11:39:03,421 WARN akka.remote.transport.netty.NettyTransport > - Remote connection to [null] failed with > java.net.ConnectException: Connection refused: host1/ipaddress1:28681 > 2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor > - Association with remote system > [akka.tcp://flink@host1:28681] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused: > host1/ipaddress1:28681] > > Thanks, > Dinesh > > On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <dineshj...@gmail.com> wrote: > >> Hi all, >> We have single job yarn flink cluster setup with High Availability. >> Sometimes job manager failure successfully restarts next attempt from >> current checkpoint. >> But occasionally we are getting below error. >> >> {"errors":["Service temporarily unavailable due to an ongoing leader >> election. Please refresh."]} >> >> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169 >> >> Flink version: flink-1.7.2 >> >> Zookeeper version: 3.4.6-169--1 >> >> >> *Below is the flink configuration* >> >> high-availability: zookeeper >> >> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181 >> >> high-availability.storageDir: hdfs:///flink/ha >> >> high-availability.zookeeper.path.root: /flink >> >> yarn.application-attempts: 10 >> >> state.backend: rocksdb >> >> state.checkpoints.dir: hdfs:///flink/checkpoint >> >> state.savepoints.dir: hdfs:///flink/savepoint >> >> jobmanager.execution.failover-strategy: region >> >> restart-strategy: failure-rate >> >> restart-strategy.failure-rate.max-failures-per-interval: 3 >> >> restart-strategy.failure-rate.failure-rate-interval: 5 min >> >> restart-strategy.failure-rate.delay: 10 s >> >> >> >> Can someone let know if I am missing something or is it a known issue? >> >> Is it something related to hostname ip mapping issue or zookeeper version >> issue? >> >> Thanks, >> >> Dinesh >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >> >>