Re: Issue with single job yarn flink cluster HA

Andrey Zagrebin Thu, 02 Apr 2020 02:53:30 -0700

Hi Dinesh,

Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g.
[1] and [2].
I would suggest to try Flink 1.10.
If the problem persists, could you also find the logs of the failed Job
Manager before the failover?


Best,
Andrey

[1] https://jira.apache.org/jira/browse/FLINK-14316
[2] https://jira.apache.org/jira/browse/FLINK-11843

On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <[email protected]> wrote:

> Hi Yang,
> I am attaching one full jobmanager log for a job which I reran today. This
> a job that tries to read from savepoint.
> Same error message "leader election onging" is displayed. And this stays
> the same even after 30 minutes. If I leave the job without yarn kill, it
> stays the same forever.
> Based on your suggestions till now, I guess it might be some zookeeper
> problem. If that is the case, what can I lookout for in zookeeper to figure
> out the issue?
>
> Thanks,
> Dinesh
>
>
> On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <[email protected]> wrote:
>
>> I think your problem is not about akka timeout. Increase the timeout
>> could help in a
>> heavy load cluster, especially for the network is not very good. However,
>> that is not
>> your case now.
>>
>> I am not sure about the "never recovery". Do you mean the logs
>> "Connection refused"
>> keep going and do not have other logs? How long does it stay in "leader
>> election onging".
>> Usually, it takes at most 60s. Since if the old jobmanager crashed, then
>> it will lose
>> the leadership after zookeeper session timeout. So when the new
>> jobmanager always
>> could not grant the leadership, it may because of some problem of
>> zookeeper.
>>
>> Maybe you need to share the complete jobmanager logs so that we could
>> know what
>> is happening in the jobmanager.
>>
>>
>> Best,
>> Yang
>>
>>
>> Dinesh J <[email protected]> 于2020年3月31日周二 上午3:46写道：
>>
>>> HI Yang,
>>> Thanks for the clarification and suggestion. But my problem was that
>>> recovery never happens and the message "leader election ongoing" is what
>>> the message displayed forever.
>>> Do you think increasing akka.ask.timeout and akka.tcp.timeout will help
>>> in case of a heavy/highload cluster as this issue happens mainly during
>>> heavy load in cluster?
>>>
>>> Best,
>>> Dinesh
>>>
>>> On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <[email protected]> wrote:
>>>
>>>> Hi Dinesh,
>>>>
>>>> First, i think the error message your provided is not a problem. It
>>>> just indicates that the leader
>>>> election is still ongoing. When it finished, the new leader will start
>>>> the a new dispatcher to provide
>>>> the webui and rest service.
>>>>
>>>> From your jobmanager logs "Connection refused: host1/ipaddress1:28681",
>>>> we could know that
>>>> the old jobmanager has failed. When a new jobmanager started, since the
>>>> old jobmanager still
>>>> hold the lock of leader latch. So Flink tries to connect with it. After
>>>> it tries few times, since the old
>>>> jobmanager zookeeper client do not update the leader latch, then the
>>>> new jobmanager will elect
>>>> successfully and be the active leader. It is just how the leader
>>>> election works.
>>>>
>>>> In a nutshell, the root cause is old jobmanager crashed and it does not
>>>> lose the leader immediately.
>>>> It is the by-design behavior.
>>>>
>>>> If you really want to make the recovery faster, i think you could
>>>> decrease "high-availability.zookeeper.client.connection-timeout"
>>>> and "high-availability.zookeeper.client.session-timeout". Please keep
>>>> in mind that too small value
>>>> will also cause unexpected failover because of network problem.
>>>>
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Dinesh J <[email protected]> 于2020年3月25日周三 下午4:20写道：
>>>>
>>>>> Hi Andrey,
>>>>> Yes . The job is not restarting sometimes after the current leader
>>>>> failure.
>>>>> Below is the message displayed when trying to reach the application
>>>>> master url via yarn ui and message remains the same even if the yarn job 
>>>>> is
>>>>> running for 2 days.
>>>>> During this time , even current yarn application attempt is not
>>>>> getting failed and no containers are launched for jobmanager and
>>>>> taskmanager.
>>>>>
>>>>> *{"errors":["Service temporarily unavailable due to an ongoing leader
>>>>> election. Please refresh."]}*
>>>>>
>>>>> Thanks,
>>>>> Dinesh
>>>>>
>>>>> On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Dinesh,
>>>>>>
>>>>>> If the current leader crashes (e.g. due to network failures) then
>>>>>> getting these messages do not look like a problem during the leader
>>>>>> re-election.
>>>>>> They look to me just as warnings that caused failover.
>>>>>>
>>>>>> Do you observe any problem with your application? Does the failover
>>>>>> not work, e.g. no leader is elected or a job is not restarted after the
>>>>>> current leader failure?
>>>>>>
>>>>>> Best,
>>>>>> Andrey
>>>>>>
>>>>>> On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Attaching the job manager log for reference.
>>>>>>>
>>>>>>> 2020-03-22 11:39:02,693 WARN
>>>>>>>  org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever 
>>>>>>>  -
>>>>>>> Error while retrieving the leader gateway. Retrying to connect to
>>>>>>> akka.tcp://flink@host1:28681/user/dispatcher.
>>>>>>> 2020-03-22 11:39:02,724 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:02,724 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:02,791 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:02,792 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:02,861 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:02,861 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:02,931 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:02,931 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:03,001 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:03,002 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:03,071 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:03,071 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:03,141 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:03,141 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:03,211 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:03,211 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:03,281 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:03,282 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:03,351 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:03,351 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>> 2020-03-22 11:39:03,421 WARN
>>>>>>>  akka.remote.transport.netty.NettyTransport                    - Remote
>>>>>>> connection to [null] failed with java.net.ConnectException: Connection
>>>>>>> refused: host1/ipaddress1:28681
>>>>>>> 2020-03-22 11:39:03,421 WARN  akka.remote.ReliableDeliverySupervisor
>>>>>>>                        - Association with remote system
>>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for
>>>>>>> [50] ms. Reason: [Association failed with 
>>>>>>> [akka.tcp://flink@host1:28681]]
>>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681]
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dinesh
>>>>>>>
>>>>>>> On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> We have single job yarn flink cluster setup with High Availability.
>>>>>>>> Sometimes job manager failure successfully restarts next attempt
>>>>>>>> from current checkpoint.
>>>>>>>> But occasionally we are getting below error.
>>>>>>>>
>>>>>>>> {"errors":["Service temporarily unavailable due to an ongoing leader 
>>>>>>>> election. Please refresh."]}
>>>>>>>>
>>>>>>>> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169
>>>>>>>>
>>>>>>>> Flink version: flink-1.7.2
>>>>>>>>
>>>>>>>> Zookeeper version: 3.4.6-169--1
>>>>>>>>
>>>>>>>>
>>>>>>>> *Below is the flink configuration*
>>>>>>>>
>>>>>>>> high-availability: zookeeper
>>>>>>>>
>>>>>>>> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181
>>>>>>>>
>>>>>>>> high-availability.storageDir: hdfs:///flink/ha
>>>>>>>>
>>>>>>>> high-availability.zookeeper.path.root: /flink
>>>>>>>>
>>>>>>>> yarn.application-attempts: 10
>>>>>>>>
>>>>>>>> state.backend: rocksdb
>>>>>>>>
>>>>>>>> state.checkpoints.dir: hdfs:///flink/checkpoint
>>>>>>>>
>>>>>>>> state.savepoints.dir: hdfs:///flink/savepoint
>>>>>>>>
>>>>>>>> jobmanager.execution.failover-strategy: region
>>>>>>>>
>>>>>>>> restart-strategy: failure-rate
>>>>>>>>
>>>>>>>> restart-strategy.failure-rate.max-failures-per-interval: 3
>>>>>>>>
>>>>>>>> restart-strategy.failure-rate.failure-rate-interval: 5 min
>>>>>>>>
>>>>>>>> restart-strategy.failure-rate.delay: 10 s
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Can someone let know if I am missing something or is it a known issue?
>>>>>>>>
>>>>>>>> Is it something related to hostname ip mapping issue or zookeeper 
>>>>>>>> version issue?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Dinesh
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>

Re: Issue with single job yarn flink cluster HA

Reply via email to