Hi Dinesh,

If the current leader crashes (e.g. due to network failures) then getting
these messages do not look like a problem during the leader re-election.
They look to me just as warnings that caused failover.

Do you observe any problem with your application? Does the failover not
work, e.g. no leader is elected or a job is not restarted after the current
leader failure?

Best,
Andrey

On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <dineshj...@gmail.com> wrote:

> Attaching the job manager log for reference.
>
> 2020-03-22 11:39:02,693 WARN
>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
> Error while retrieving the leader gateway. Retrying to connect to
> akka.tcp://flink@host1:28681/user/dispatcher.
> 2020-03-22 11:39:02,724 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:02,724 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:02,791 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:02,792 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:02,861 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:02,861 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:02,931 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:02,931 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:03,001 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:03,002 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:03,071 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:03,071 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:03,141 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:03,141 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:03,211 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:03,211 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:03,281 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:03,282 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:03,351 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:03,351 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
> 2020-03-22 11:39:03,421 WARN  akka.remote.transport.netty.NettyTransport
>                  - Remote connection to [null] failed with
> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
> 2020-03-22 11:39:03,421 WARN  akka.remote.ReliableDeliverySupervisor
>                  - Association with remote system 
> [akka.tcp://flink@host1:28681]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
> host1/ipaddress1:28681]
>
> Thanks,
> Dinesh
>
> On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <dineshj...@gmail.com> wrote:
>
>> Hi all,
>> We have single job yarn flink cluster setup with High Availability.
>> Sometimes job manager failure successfully restarts next attempt from
>> current checkpoint.
>> But occasionally we are getting below error.
>>
>> {"errors":["Service temporarily unavailable due to an ongoing leader 
>> election. Please refresh."]}
>>
>> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
>>
>> Flink version: flink-1.7.2
>>
>> Zookeeper version: 3.4.6-169--1
>>
>>
>> *Below is the flink configuration*
>>
>> high-availability: zookeeper
>>
>> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
>>
>> high-availability.storageDir: hdfs:///flink/ha
>>
>> high-availability.zookeeper.path.root: /flink
>>
>> yarn.application-attempts: 10
>>
>> state.backend: rocksdb
>>
>> state.checkpoints.dir: hdfs:///flink/checkpoint
>>
>> state.savepoints.dir: hdfs:///flink/savepoint
>>
>> jobmanager.execution.failover-strategy: region
>>
>> restart-strategy: failure-rate
>>
>> restart-strategy.failure-rate.max-failures-per-interval: 3
>>
>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>>
>> restart-strategy.failure-rate.delay: 10 s
>>
>>
>>
>> Can someone let know if I am missing something or is it a known issue?
>>
>> Is it something related to hostname ip mapping issue or zookeeper version 
>> issue?
>>
>> Thanks,
>>
>> Dinesh
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>

Reply via email to