Re: Issue with single job yarn flink cluster HA

Dinesh J Wed, 25 Mar 2020 01:21:35 -0700

Hi Andrey,
Yes . The job is not restarting sometimes after the current leader failure.
Below is the message displayed when trying to reach the application master
url via yarn ui and message remains the same even if the yarn job is
running for 2 days.
During this time , even current yarn application attempt is not getting
failed and no containers are launched for jobmanager and taskmanager.


*{"errors":["Service temporarily unavailable due to an ongoing leader
election. Please refresh."]}*

Thanks,
Dinesh

On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[email protected]>
wrote:

> Hi Dinesh,
>
> If the current leader crashes (e.g. due to network failures) then getting
> these messages do not look like a problem during the leader re-election.
> They look to me just as warnings that caused failover.
>
> Do you observe any problem with your application? Does the failover not
> work, e.g. no leader is elected or a job is not restarted after the current
> leader failure?
>
> Best,
> Andrey
>
> On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[email protected]> wrote:
>
>> Attaching the job manager log for reference.
>>
>> 2020-03-22 11:39:02,693 WARN
>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever  -
>> Error while retrieving the leader gateway. Retrying to connect to
>> akka.tcp://flink@host1:28681/user/dispatcher.
>> 2020-03-22 11:39:02,724 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:02,724 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:02,791 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:02,792 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:02,861 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:02,861 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:02,931 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:02,931 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:03,001 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:03,002 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:03,071 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:03,071 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:03,141 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:03,141 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:03,211 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:03,211 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:03,281 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:03,282 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:03,351 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:03,351 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>> 2020-03-22 11:39:03,421 WARN  akka.remote.transport.netty.NettyTransport
>>                    - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused: host1/ipaddress1:28681
>> 2020-03-22 11:39:03,421 WARN  akka.remote.ReliableDeliverySupervisor
>>                    - Association with remote system 
>> [akka.tcp://flink@host1:28681]
>> has failed, address is now gated for [50] ms. Reason: [Association failed
>> with [akka.tcp://flink@host1:28681]] Caused by: [Connection refused:
>> host1/ipaddress1:28681]
>>
>> Thanks,
>> Dinesh
>>
>> On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[email protected]> wrote:
>>
>>> Hi all,
>>> We have single job yarn flink cluster setup with High Availability.
>>> Sometimes job manager failure successfully restarts next attempt from
>>> current checkpoint.
>>> But occasionally we are getting below error.
>>>
>>> {"errors":["Service temporarily unavailable due to an ongoing leader 
>>> election. Please refresh."]}
>>>
>>> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
>>>
>>> Flink version: flink-1.7.2
>>>
>>> Zookeeper version: 3.4.6-169--1
>>>
>>>
>>> *Below is the flink configuration*
>>>
>>> high-availability: zookeeper
>>>
>>> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
>>>
>>> high-availability.storageDir: hdfs:///flink/ha
>>>
>>> high-availability.zookeeper.path.root: /flink
>>>
>>> yarn.application-attempts: 10
>>>
>>> state.backend: rocksdb
>>>
>>> state.checkpoints.dir: hdfs:///flink/checkpoint
>>>
>>> state.savepoints.dir: hdfs:///flink/savepoint
>>>
>>> jobmanager.execution.failover-strategy: region
>>>
>>> restart-strategy: failure-rate
>>>
>>> restart-strategy.failure-rate.max-failures-per-interval: 3
>>>
>>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>>>
>>> restart-strategy.failure-rate.delay: 10 s
>>>
>>>
>>>
>>> Can someone let know if I am missing something or is it a known issue?
>>>
>>> Is it something related to hostname ip mapping issue or zookeeper version 
>>> issue?
>>>
>>> Thanks,
>>>
>>> Dinesh
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: Issue with single job yarn flink cluster HA

Reply via email to