Hi Dinesh, Thanks for sharing the logs. There were couple of HA fixes since 1.7, e.g. [1] and [2]. I would suggest to try Flink 1.10. If the problem persists, could you also find the logs of the failed Job Manager before the failover?
Best, Andrey [1] https://jira.apache.org/jira/browse/FLINK-14316 [2] https://jira.apache.org/jira/browse/FLINK-11843 On Tue, Mar 31, 2020 at 6:49 AM Dinesh J <[email protected]> wrote: > Hi Yang, > I am attaching one full jobmanager log for a job which I reran today. This > a job that tries to read from savepoint. > Same error message "leader election onging" is displayed. And this stays > the same even after 30 minutes. If I leave the job without yarn kill, it > stays the same forever. > Based on your suggestions till now, I guess it might be some zookeeper > problem. If that is the case, what can I lookout for in zookeeper to figure > out the issue? > > Thanks, > Dinesh > > > On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <[email protected]> wrote: > >> I think your problem is not about akka timeout. Increase the timeout >> could help in a >> heavy load cluster, especially for the network is not very good. However, >> that is not >> your case now. >> >> I am not sure about the "never recovery". Do you mean the logs >> "Connection refused" >> keep going and do not have other logs? How long does it stay in "leader >> election onging". >> Usually, it takes at most 60s. Since if the old jobmanager crashed, then >> it will lose >> the leadership after zookeeper session timeout. So when the new >> jobmanager always >> could not grant the leadership, it may because of some problem of >> zookeeper. >> >> Maybe you need to share the complete jobmanager logs so that we could >> know what >> is happening in the jobmanager. >> >> >> Best, >> Yang >> >> >> Dinesh J <[email protected]> 于2020年3月31日周二 上午3:46写道: >> >>> HI Yang, >>> Thanks for the clarification and suggestion. But my problem was that >>> recovery never happens and the message "leader election ongoing" is what >>> the message displayed forever. >>> Do you think increasing akka.ask.timeout and akka.tcp.timeout will help >>> in case of a heavy/highload cluster as this issue happens mainly during >>> heavy load in cluster? >>> >>> Best, >>> Dinesh >>> >>> On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <[email protected]> wrote: >>> >>>> Hi Dinesh, >>>> >>>> First, i think the error message your provided is not a problem. It >>>> just indicates that the leader >>>> election is still ongoing. When it finished, the new leader will start >>>> the a new dispatcher to provide >>>> the webui and rest service. >>>> >>>> From your jobmanager logs "Connection refused: host1/ipaddress1:28681", >>>> we could know that >>>> the old jobmanager has failed. When a new jobmanager started, since the >>>> old jobmanager still >>>> hold the lock of leader latch. So Flink tries to connect with it. After >>>> it tries few times, since the old >>>> jobmanager zookeeper client do not update the leader latch, then the >>>> new jobmanager will elect >>>> successfully and be the active leader. It is just how the leader >>>> election works. >>>> >>>> In a nutshell, the root cause is old jobmanager crashed and it does not >>>> lose the leader immediately. >>>> It is the by-design behavior. >>>> >>>> If you really want to make the recovery faster, i think you could >>>> decrease "high-availability.zookeeper.client.connection-timeout" >>>> and "high-availability.zookeeper.client.session-timeout". Please keep >>>> in mind that too small value >>>> will also cause unexpected failover because of network problem. >>>> >>>> >>>> Best, >>>> Yang >>>> >>>> Dinesh J <[email protected]> 于2020年3月25日周三 下午4:20写道: >>>> >>>>> Hi Andrey, >>>>> Yes . The job is not restarting sometimes after the current leader >>>>> failure. >>>>> Below is the message displayed when trying to reach the application >>>>> master url via yarn ui and message remains the same even if the yarn job >>>>> is >>>>> running for 2 days. >>>>> During this time , even current yarn application attempt is not >>>>> getting failed and no containers are launched for jobmanager and >>>>> taskmanager. >>>>> >>>>> *{"errors":["Service temporarily unavailable due to an ongoing leader >>>>> election. Please refresh."]}* >>>>> >>>>> Thanks, >>>>> Dinesh >>>>> >>>>> On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi Dinesh, >>>>>> >>>>>> If the current leader crashes (e.g. due to network failures) then >>>>>> getting these messages do not look like a problem during the leader >>>>>> re-election. >>>>>> They look to me just as warnings that caused failover. >>>>>> >>>>>> Do you observe any problem with your application? Does the failover >>>>>> not work, e.g. no leader is elected or a job is not restarted after the >>>>>> current leader failure? >>>>>> >>>>>> Best, >>>>>> Andrey >>>>>> >>>>>> On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Attaching the job manager log for reference. >>>>>>> >>>>>>> 2020-03-22 11:39:02,693 WARN >>>>>>> org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever >>>>>>> - >>>>>>> Error while retrieving the leader gateway. Retrying to connect to >>>>>>> akka.tcp://flink@host1:28681/user/dispatcher. >>>>>>> 2020-03-22 11:39:02,724 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:02,791 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:02,861 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:02,931 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:03,001 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:03,071 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:03,141 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:03,211 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:03,281 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:03,351 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> 2020-03-22 11:39:03,421 WARN >>>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>>> refused: host1/ipaddress1:28681 >>>>>>> 2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor >>>>>>> - Association with remote system >>>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>>> [50] ms. Reason: [Association failed with >>>>>>> [akka.tcp://flink@host1:28681]] >>>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>>> >>>>>>> Thanks, >>>>>>> Dinesh >>>>>>> >>>>>>> On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi all, >>>>>>>> We have single job yarn flink cluster setup with High Availability. >>>>>>>> Sometimes job manager failure successfully restarts next attempt >>>>>>>> from current checkpoint. >>>>>>>> But occasionally we are getting below error. >>>>>>>> >>>>>>>> {"errors":["Service temporarily unavailable due to an ongoing leader >>>>>>>> election. Please refresh."]} >>>>>>>> >>>>>>>> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169 >>>>>>>> >>>>>>>> Flink version: flink-1.7.2 >>>>>>>> >>>>>>>> Zookeeper version: 3.4.6-169--1 >>>>>>>> >>>>>>>> >>>>>>>> *Below is the flink configuration* >>>>>>>> >>>>>>>> high-availability: zookeeper >>>>>>>> >>>>>>>> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181 >>>>>>>> >>>>>>>> high-availability.storageDir: hdfs:///flink/ha >>>>>>>> >>>>>>>> high-availability.zookeeper.path.root: /flink >>>>>>>> >>>>>>>> yarn.application-attempts: 10 >>>>>>>> >>>>>>>> state.backend: rocksdb >>>>>>>> >>>>>>>> state.checkpoints.dir: hdfs:///flink/checkpoint >>>>>>>> >>>>>>>> state.savepoints.dir: hdfs:///flink/savepoint >>>>>>>> >>>>>>>> jobmanager.execution.failover-strategy: region >>>>>>>> >>>>>>>> restart-strategy: failure-rate >>>>>>>> >>>>>>>> restart-strategy.failure-rate.max-failures-per-interval: 3 >>>>>>>> >>>>>>>> restart-strategy.failure-rate.failure-rate-interval: 5 min >>>>>>>> >>>>>>>> restart-strategy.failure-rate.delay: 10 s >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Can someone let know if I am missing something or is it a known issue? >>>>>>>> >>>>>>>> Is it something related to hostname ip mapping issue or zookeeper >>>>>>>> version issue? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Dinesh >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>
