Hi Yang, I am attaching one full jobmanager log for a job which I reran today. This a job that tries to read from savepoint. Same error message "leader election onging" is displayed. And this stays the same even after 30 minutes. If I leave the job without yarn kill, it stays the same forever. Based on your suggestions till now, I guess it might be some zookeeper problem. If that is the case, what can I lookout for in zookeeper to figure out the issue?
Thanks, Dinesh On Tue, Mar 31, 2020 at 7:42 AM Yang Wang <danrtsey...@gmail.com> wrote: > I think your problem is not about akka timeout. Increase the timeout could > help in a > heavy load cluster, especially for the network is not very good. However, > that is not > your case now. > > I am not sure about the "never recovery". Do you mean the logs "Connection > refused" > keep going and do not have other logs? How long does it stay in "leader > election onging". > Usually, it takes at most 60s. Since if the old jobmanager crashed, then > it will lose > the leadership after zookeeper session timeout. So when the new jobmanager > always > could not grant the leadership, it may because of some problem of > zookeeper. > > Maybe you need to share the complete jobmanager logs so that we could know > what > is happening in the jobmanager. > > > Best, > Yang > > > Dinesh J <dineshj...@gmail.com> 于2020年3月31日周二 上午3:46写道: > >> HI Yang, >> Thanks for the clarification and suggestion. But my problem was that >> recovery never happens and the message "leader election ongoing" is what >> the message displayed forever. >> Do you think increasing akka.ask.timeout and akka.tcp.timeout will help >> in case of a heavy/highload cluster as this issue happens mainly during >> heavy load in cluster? >> >> Best, >> Dinesh >> >> On Mon, Mar 30, 2020 at 2:29 PM Yang Wang <danrtsey...@gmail.com> wrote: >> >>> Hi Dinesh, >>> >>> First, i think the error message your provided is not a problem. It >>> just indicates that the leader >>> election is still ongoing. When it finished, the new leader will start >>> the a new dispatcher to provide >>> the webui and rest service. >>> >>> From your jobmanager logs "Connection refused: host1/ipaddress1:28681", >>> we could know that >>> the old jobmanager has failed. When a new jobmanager started, since the >>> old jobmanager still >>> hold the lock of leader latch. So Flink tries to connect with it. After >>> it tries few times, since the old >>> jobmanager zookeeper client do not update the leader latch, then the new >>> jobmanager will elect >>> successfully and be the active leader. It is just how the leader >>> election works. >>> >>> In a nutshell, the root cause is old jobmanager crashed and it does not >>> lose the leader immediately. >>> It is the by-design behavior. >>> >>> If you really want to make the recovery faster, i think you could >>> decrease "high-availability.zookeeper.client.connection-timeout" >>> and "high-availability.zookeeper.client.session-timeout". Please keep in >>> mind that too small value >>> will also cause unexpected failover because of network problem. >>> >>> >>> Best, >>> Yang >>> >>> Dinesh J <dineshj...@gmail.com> 于2020年3月25日周三 下午4:20写道: >>> >>>> Hi Andrey, >>>> Yes . The job is not restarting sometimes after the current leader >>>> failure. >>>> Below is the message displayed when trying to reach the application >>>> master url via yarn ui and message remains the same even if the yarn job is >>>> running for 2 days. >>>> During this time , even current yarn application attempt is not getting >>>> failed and no containers are launched for jobmanager and taskmanager. >>>> >>>> *{"errors":["Service temporarily unavailable due to an ongoing leader >>>> election. Please refresh."]}* >>>> >>>> Thanks, >>>> Dinesh >>>> >>>> On Tue, Mar 24, 2020 at 6:45 PM Andrey Zagrebin <azagre...@apache.org> >>>> wrote: >>>> >>>>> Hi Dinesh, >>>>> >>>>> If the current leader crashes (e.g. due to network failures) then >>>>> getting these messages do not look like a problem during the leader >>>>> re-election. >>>>> They look to me just as warnings that caused failover. >>>>> >>>>> Do you observe any problem with your application? Does the failover >>>>> not work, e.g. no leader is elected or a job is not restarted after the >>>>> current leader failure? >>>>> >>>>> Best, >>>>> Andrey >>>>> >>>>> On Sun, Mar 22, 2020 at 11:14 AM Dinesh J <dineshj...@gmail.com> >>>>> wrote: >>>>> >>>>>> Attaching the job manager log for reference. >>>>>> >>>>>> 2020-03-22 11:39:02,693 WARN >>>>>> org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever >>>>>> - >>>>>> Error while retrieving the leader gateway. Retrying to connect to >>>>>> akka.tcp://flink@host1:28681/user/dispatcher. >>>>>> 2020-03-22 11:39:02,724 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:02,724 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:02,791 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:02,792 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:02,861 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:02,861 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:02,931 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:02,931 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:03,001 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:03,002 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:03,071 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:03,071 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:03,141 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:03,141 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:03,211 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:03,211 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:03,281 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:03,282 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:03,351 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:03,351 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> 2020-03-22 11:39:03,421 WARN >>>>>> akka.remote.transport.netty.NettyTransport - Remote >>>>>> connection to [null] failed with java.net.ConnectException: Connection >>>>>> refused: host1/ipaddress1:28681 >>>>>> 2020-03-22 11:39:03,421 WARN akka.remote.ReliableDeliverySupervisor >>>>>> - Association with remote system >>>>>> [akka.tcp://flink@host1:28681] has failed, address is now gated for >>>>>> [50] ms. Reason: [Association failed with [akka.tcp://flink@host1:28681]] >>>>>> Caused by: [Connection refused: host1/ipaddress1:28681] >>>>>> >>>>>> Thanks, >>>>>> Dinesh >>>>>> >>>>>> On Sun, Mar 22, 2020 at 1:25 PM Dinesh J <dineshj...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> We have single job yarn flink cluster setup with High Availability. >>>>>>> Sometimes job manager failure successfully restarts next attempt >>>>>>> from current checkpoint. >>>>>>> But occasionally we are getting below error. >>>>>>> >>>>>>> {"errors":["Service temporarily unavailable due to an ongoing leader >>>>>>> election. Please refresh."]} >>>>>>> >>>>>>> Hadoop version using : Hadoop 2.7.1.2.4.0.0-169 >>>>>>> >>>>>>> Flink version: flink-1.7.2 >>>>>>> >>>>>>> Zookeeper version: 3.4.6-169--1 >>>>>>> >>>>>>> >>>>>>> *Below is the flink configuration* >>>>>>> >>>>>>> high-availability: zookeeper >>>>>>> >>>>>>> high-availability.zookeeper.quorum: host1:2181,host2:2181,host3:2181 >>>>>>> >>>>>>> high-availability.storageDir: hdfs:///flink/ha >>>>>>> >>>>>>> high-availability.zookeeper.path.root: /flink >>>>>>> >>>>>>> yarn.application-attempts: 10 >>>>>>> >>>>>>> state.backend: rocksdb >>>>>>> >>>>>>> state.checkpoints.dir: hdfs:///flink/checkpoint >>>>>>> >>>>>>> state.savepoints.dir: hdfs:///flink/savepoint >>>>>>> >>>>>>> jobmanager.execution.failover-strategy: region >>>>>>> >>>>>>> restart-strategy: failure-rate >>>>>>> >>>>>>> restart-strategy.failure-rate.max-failures-per-interval: 3 >>>>>>> >>>>>>> restart-strategy.failure-rate.failure-rate-interval: 5 min >>>>>>> >>>>>>> restart-strategy.failure-rate.delay: 10 s >>>>>>> >>>>>>> >>>>>>> >>>>>>> Can someone let know if I am missing something or is it a known issue? >>>>>>> >>>>>>> Is it something related to hostname ip mapping issue or zookeeper >>>>>>> version issue? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> Dinesh >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>>
full_log_failed_container.log
Description: Binary data