Hi Abhinav, The problem is
Curator: Background operation retry gave up So it is the ZK ensemble too unstable to get recovery in time so that Curator stopped retrying and threw a fatal error. Best, tison. Xintong Song <tonysong...@gmail.com> 于2020年3月18日周三 上午10:22写道: > I'm not familiar with ZK either. > > I've copied Yang Wang, who might be able to provide some suggestions. > > Alternatively, you can try to post your question to the Apache ZooKeeper > community, see if they have any clue. > > Thank you~ > > Xintong Song > > > > On Wed, Mar 18, 2020 at 8:12 AM Bajaj, Abhinav <abhinav.ba...@here.com> > wrote: > >> Hi Xintong, >> >> >> >> I did check the Zk logs and didn’t notice anything interesting. >> >> I have limited expertise in zookeeper. >> >> Can you share an example of what I should be looking for in Zk? >> >> >> >> I was able to reproduce this issue again with Flink 1.7 by killing the >> zookeeper leader that disrupted the quorum. >> >> The sequence of logs in this case look quite similar to one we have been >> discussing. >> >> >> >> If the code hasn’t changed in this area till 1.10 then maybe the latest >> version also has the potential issue. >> >> >> >> Its not straightforward to bump up the Flink version in the >> infrastructure available to me. >> >> But I will think if there is a way around it. >> >> >> >> ~ Abhinav Bajaj >> >> >> >> *From: *Xintong Song <tonysong...@gmail.com> >> *Date: *Monday, March 16, 2020 at 8:00 PM >> *To: *"Bajaj, Abhinav" <abhinav.ba...@here.com> >> *Cc: *"user@flink.apache.org" <user@flink.apache.org> >> *Subject: *Re: JobMaster does not register with ResourceManager in high >> availability setup >> >> >> >> Hi Abhinav, >> >> >> >> I think you are right. The log confirms that JobMaster has not tried to >> connect ResourceManager. Most likely the JobMaster requested for RM address >> but has never received it. >> >> >> >> I would suggest you to check the ZK logs, see if the request form JM for >> RM address has been received and properly responded. >> >> >> >> If you can easily reproduce this problem, and you are able to build Flink >> from source, you can also try to insert more logs in Flink to further >> confirm whether the RM address is received. I don't think that's necessary >> though, since those codes have not been changed since Flink 1.7 till the >> latest 1.10, and I'm not aware of any reported issue that the JM may not >> try to connect RM once the address is received. >> >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> >> On Tue, Mar 17, 2020 at 7:45 AM Bajaj, Abhinav <abhinav.ba...@here.com> >> wrote: >> >> Hi Xintong, >> >> >> >> Apologies for delayed response. I was away for a week. >> >> I am attaching more jobmanager logs. >> >> >> >> To your point on the taskmanagers, the job is deployed with 20 >> parallelism but it has 22 TMs to have 2 of them as spare to assist in quick >> failover. >> >> I did check the logs and all 22 of task executors from those TMs get >> registered by the time - 2020-02-27 06:35:47.050. >> >> >> >> You would notice that even after this time, the job fails with the error >> “org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate all requires slots within timeout of 300000 ms. Slots >> required: 201, slots allocated: 0” at 2020-02-27 06:40:36.778. >> >> >> >> Thanks a ton for you help. >> >> >> >> ~ Abhinav Bajaj >> >> >> >> *From: *Xintong Song <tonysong...@gmail.com> >> *Date: *Thursday, March 5, 2020 at 6:30 PM >> *To: *"Bajaj, Abhinav" <abhinav.ba...@here.com> >> *Cc: *"user@flink.apache.org" <user@flink.apache.org> >> *Subject: *Re: JobMaster does not register with ResourceManager in high >> availability setup >> >> >> >> Hi Abhinav, >> >> >> >> Thanks for the log. However, the attached log seems to be incomplete. >> The NoResourceAvailableException cannot be found in this log. >> >> >> >> Regarding connecting to ResourceManager, the log suggests that: >> >> - ZK was back to life and connected at 06:29:56. >> 2020-02-27 06:29:56.539 [main-EventThread] level=INFO >> o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager - State >> change: CONNECTED >> - RM registered to ZK and was granted leadership at 06:30:01. >> 2020-02-27 06:30:01.677 [flink-akka.actor.default-dispatcher-5] >> level=INFO o.a.flink.runtime.resourcemanager.StandaloneResourceManager - >> ResourceManager akka.tcp://flink@JOBMANAGER:6126/user/resourcemanager >> was granted leadership with fencing token a2c453481ea4e0c7722cab1e4dd741db >> - JM requests RM leader address from ZK at 06:30:06. >> 2020-02-27 06:30:06.272 [flink-akka.actor.default-dispatcher-17] >> level=INFO o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService - >> Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. >> - The RM leader address will be notified asynchronously, and only >> after that JM will try to connect to RM (printing the "Connecting to >> ResourceManager" log). The attached log ends in 100ms after JM requesting >> RM leader address, which is too short to tell whether the RM is connected >> properly. >> >> Another finding is about the TM registration. According to the log: >> >> - The parallelism of your job is 20, which means it needs 20 slots to >> be executed. >> - There are only 5 TMs registered. (Searching for "Registering >> TaskManager with ResourceID") >> - Assuming you have the same configurations for JM and TMs (this >> might not always be true), you have one slot per TM. >> 599 2020-02-27 06:28:56.495 [main] level=INFO >> org.apache.flink.configuration.GlobalConfiguration - Loading >> configuration property: taskmanager.numberOfTaskSlots, 1 >> - That suggests that it is possible that not all the TaskExecutors >> are recovered/reconnected, leading to the NoResourceAvailableException. We >> would need the rest part of the log (from where the current one ends to >> the NoResourceAvailableException) to tell what happened during the >> scheduling. Also, could you confirm how many TMs do you use? >> >> >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> >> On Fri, Mar 6, 2020 at 5:55 AM Bajaj, Abhinav <abhinav.ba...@here.com> >> wrote: >> >> Hi Xintong, >> >> >> >> Highly appreciate your assistance here. >> >> I am attaching the jobmanager log for reference. >> >> >> >> Let me share my quick responses on what you mentioned. >> >> >> >> >> >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot >> request, no ResourceManager connected. >> >> *XS: Sometimes you see this log because the ResourceManager is not yet >> connect when the slot request arrives the SlotPool. If the ResourceManager >> is connected later, the SlotPool will still send the pending slot requests, >> in that case you should find logs for SlotPool requesting slots from >> ResourceManager.* >> >> >> >> *AB*: Yes, I have noticed that behavior in scenarios where >> resourcemanager and jobmanager are connected successfully. The requests >> fail initially and they are served later when they are connected. >> >> I don’t think that happened in this case. But you have access to the >> jobmanager logs to check my understanding. >> >> >> >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate all requires slots within timeout of 300000 ms…… >> >> *XS: This error message simply means that the slot requests are not >> satisfied in 5min. Various reasons might cause this problem.* >> >> - *The ResourceManager is not connected at all.* >> >> >> - *AB*: I think Resoucemanager is not connected to Jobmaster or vice >> versa. My basis is *absence* of below logs – >> >> >> - org.apache.flink.runtime.jobmaster.JobMaster - Connecting to >> ResourceManager.... >> - o.a.flink.runtime.resourcemanager.StandaloneResourceManager >> - Registering job manager.... >> >> >> - *The ResourceManager is connected, but some TaskExecutors are not >> registered due to the ZK problem. * >> >> >> - *AB*: I think the Task Executors were able to register or were in >> the process of registering with ResourceManager. >> >> >> - *ZK recovery takes too much time, so that despite all JM, RM, TMs >> are able to connect to the ZK there might not be enough time to satisfy >> the >> slot request before the timeout.* >> >> >> - *AB*: To help check that may be you can use this log time >> >> >> - 2020-02-27 06:29:53,732 [myid:1] - INFO >> [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@64] - FOLLOWING - >> LEADER ELECTION TOOK - 25069 >> - 2020-02-27 06:29:53,766 [myid:1] - INFO >> [QuorumPeer[myid=1]/0.0.0.0:2181:Learner@332] - Getting a diff >> from the leader 0x200002bf6 >> >> Thanks a lot for looking into this. >> >> ~ Abhinav Bajaj >> >> >> >> >> >> *From: *Xintong Song <tonysong...@gmail.com> >> *Date: *Wednesday, March 4, 2020 at 7:17 PM >> *To: *"Bajaj, Abhinav" <abhinav.ba...@here.com> >> *Cc: *"user@flink.apache.org" <user@flink.apache.org> >> *Subject: *Re: JobMaster does not register with ResourceManager in high >> availability setup >> >> >> >> Hi Abhinav, >> >> >> >> Do you mind sharing the complete 'jobmanager.log'? >> >> >> >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot >> request, no ResourceManager connected. >> >> Sometimes you see this log because the ResourceManager is not yet connect >> when the slot request arrives the SlotPool. If the ResourceManager is >> connected later, the SlotPool will still send the pending slot requests, in >> that case you should find logs for SlotPool requesting slots from >> ResourceManager. >> >> >> >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate all requires slots within timeout of 300000 ms…… >> >> This error message simply means that the slot requests are not satisfied >> in 5min. Various reasons might cause this problem. >> >> - The ResourceManager is not connected at all. >> - The ResourceManager is connected, but some TaskExecutors are not >> registered due to the ZK problem. >> - ZK recovery takes too much time, so that despite all JM, RM, TMs >> are able to connect to the ZK there might not be enough time to satisfy >> the >> slot request before the timeout. >> >> It would need the complete 'jobmanager.log' (at least those from the job >> restart to the NoResourceAvailableException) to find out which is the case. >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> >> On Thu, Mar 5, 2020 at 7:30 AM Bajaj, Abhinav <abhinav.ba...@here.com> >> wrote: >> >> While I setup to reproduce the issue with debug logs, I would like to >> share more information I noticed in INFO logs. >> >> >> >> Below is the sequence of events/exceptions I notice during the time >> zookeeper was disrupted. >> >> I apologize in advance as they are a bit verbose. >> >> >> >> - Zookeeper seems to be down and leader election is disrupted – >> >> >> >> · 2020-02-27 06:28:23.572 [Curator-ConnectionStateManager-0] >> level=WARN o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService - >> Connection to ZooKeeper suspended. The contender >> akka.tcp://flink@FOO_BAR:6126/user/resourcemanager >> no longer participates in the leader election. >> >> · 2020-02-27 06:28:23.573 [Curator-ConnectionStateManager-0] >> level=INFO org.apache.flink.runtime.jobmaster.JobManagerRunner - >> JobManager for job FOO_BAR (5a910928a71b469a091be168b0e74722) was revoked >> leadership at akka.tcp://flink@ FOO_BAR:6126/user/jobmanager_1. >> >> · 2020-02-27 06:28:23.573 >> [flink-akka.actor.default-dispatcher-9897] level=INFO >> o.a.flink.runtime.resourcemanager.StandaloneResourceManager - >> ResourceManager akka.tcp://flink@ FOO_BAR:6126/user/resourcemanager was >> revoked leadership. Clearing fencing token. >> >> · 2020-02-27 06:28:23.574 >> [flink-akka.actor.default-dispatcher-9897] level=INFO >> o.a.flink.runtime.resourcemanager.slotmanager.SlotManager - Suspending the >> SlotManager. >> >> · 2020-02-27 06:28:53.577 [Curator-Framework-0] level=ERROR >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error >> occurred in the cluster entrypoint. >> >> org.apache.flink.runtime.dispatcher.DispatcherException: Received an >> error from the LeaderElectionService. >> >> at >> org.apache.flink.runtime.dispatcher.Dispatcher.handleError(Dispatcher.java:941) >> >> at >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:416) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:576) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:572) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:571) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:740) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) >> >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) >> >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >> >> at java.lang.Thread.run(Thread.java:748) >> >> Caused by: org.apache.flink.util.FlinkException: Unhandled error in >> ZooKeeperLeaderElectionService: Background operation retry gave up >> >> ... 18 common frames omitted >> >> Caused by: >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$ConnectionLossException: >> KeeperErrorCode = ConnectionLoss >> >> at >> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728) >> >> ... 10 common frames omitted >> >> >> >> - ClusterEntrypoint restarts and tries to connect to Zookeeper. It >> seems its fails for some time but able to connect later - >> >> >> >> · 2020-02-27 06:28:56.467 [main] level=INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Starting >> StandaloneSessionClusterEntrypoint (Version: 1.7.1, Rev:89eafb4, >> Date:14.12.2018 @ 15:48:34 GMT) >> >> · 2020-02-27 06:29:16.477 [main] level=ERROR >> o.a.flink.shaded.curator.org.apache.curator.ConnectionState - Connection >> timed out for connection string (ZOO_BAR_0:2181, ZOO_BAR_1:2181, >> ZOO_BAR_2:2181) and timeout (15000) / elapsed (15969) >> >> org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException: >> KeeperErrorCode = ConnectionLoss >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175) >> >> at >> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154) >> >> at >> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.start(ZooKeeperLeaderElectionService.java:134) >> >> at >> org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.startInternal(WebMonitorEndpoint.java:712) >> >> at >> org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:218) >> >> at >> org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:145) >> >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215) >> >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:163) >> >> at >> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) >> >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:162) >> >> at >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:517) >> >> at >> org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65) >> >> · 2020-02-27 06:30:01.643 [main] level=INFO >> o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting >> ZooKeeperLeaderElectionService >> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}. >> >> · 2020-02-27 06:30:01.655 [main] level=INFO >> o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService - Starting >> ZooKeeperLeaderElectionService >> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. >> >> · 2020-02-27 06:30:01.677 >> [flink-akka.actor.default-dispatcher-16] level=INFO >> org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher >> akka.tcp://flink@FOO_BAR:6126/user/dispatcher was granted leadership >> with fencing token 113d78b5-6c33-401b-9f47-2f7a1d6dfefc >> >> · 2020-02-27 06:30:01.677 >> [flink-akka.actor.default-dispatcher-5] level=INFO >> o.a.flink.runtime.resourcemanager.StandaloneResourceManager - >> ResourceManager akka.tcp://flink@FOO_BAR:6126/user/resourcemanager was >> granted leadership with fencing token a2c453481ea4e0c7722cab1e4dd741db >> >> · 2020-02-27 06:30:06.251 [main-EventThread] level=INFO >> org.apache.flink.runtime.jobmaster.JobManagerRunner - JobManager runner >> for job HHW_WEATHER_PROCESSOR (5a910928a71b469a091be168b0e74722) was >> granted leadership with session id 32637c78-cfd4-4e44-8c10-c551bac40742 at >> akka.tcp://flink@FOO_BAR:6126/user/jobmanager_0. >> >> >> >> My zookeeper knowledge is a bit limited but I do notice that the below >> log from zookeeper instance came back up and joined the quorum as follower >> before the above highlighted logs on jobmanager side. >> >> · 2020-02-27 06:29:53,732 [myid:1] - INFO >> [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@64] - FOLLOWING - LEADER >> ELECTION TOOK - 25069 >> >> · 2020-02-27 06:29:53,766 [myid:1] - INFO >> [QuorumPeer[myid=1]/0.0.0.0:2181:Learner@332] - Getting a diff from the >> leader 0x200002bf6 >> >> >> >> >> >> I will setup to reproduce this issue and get debug logs as well. >> >> >> >> But in meantime, does the above hightlighted logs confirm that zookeeper >> become available around that time? >> >> I don’t see any logs from JobMaster complaining for not being able to >> connect to zookeeper after that. >> >> >> >> ~ Abhinav Bajaj >> >> >> >> *From: *"Bajaj, Abhinav" <abhinav.ba...@here.com> >> *Date: *Wednesday, March 4, 2020 at 12:01 PM >> *To: *Xintong Song <tonysong...@gmail.com> >> *Cc: *"user@flink.apache.org" <user@flink.apache.org> >> *Subject: *Re: JobMaster does not register with ResourceManager in high >> availability setup >> >> >> >> Thanks Xintong for pointing that out. >> >> >> >> I will dig deeper and get back with my findings. >> >> >> >> ~ Abhinav Bajaj >> >> >> >> *From: *Xintong Song <tonysong...@gmail.com> >> *Date: *Tuesday, March 3, 2020 at 7:36 PM >> *To: *"Bajaj, Abhinav" <abhinav.ba...@here.com> >> *Cc: *"user@flink.apache.org" <user@flink.apache.org> >> *Subject: *Re: JobMaster does not register with ResourceManager in high >> availability setup >> >> >> >> Hi Abhinav, >> >> >> The JobMaster log "Connecting to ResourceManager ..." is printed after >> JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I >> assume there's some ZK problem that JM cannot resolve RM address. >> >> >> >> Have you confirmed whether the ZK pods are recovered after the second >> disruption? And does the address changed? >> >> >> >> You can also try to enable debug logs for the following components, to >> see if there's any useful information. >> >> org.apache.flink.runtime.jobmaster >> >> org.apache.flink.runtime.resourcemanager >> >> org.apache.flink.runtime.highavailability >> >> org.apache.flink.runtime.leaderretrieval >> >> org.apache.zookeeper >> >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> >> On Wed, Mar 4, 2020 at 5:42 AM Bajaj, Abhinav <abhinav.ba...@here.com> >> wrote: >> >> Hi, >> >> >> >> We recently came across an issue where JobMaster does not register with >> ResourceManager in Fink high availability setup. >> >> Let me share the details below. >> >> >> >> *Setup* >> >> - Flink 1.7.1 >> - K8s >> - High availability mode with a *single* Jobmanager and 3 zookeeper >> nodes in quorum. >> >> >> >> *Scenario* >> >> - Zookeeper pods are disrupted by K8s that leads to resetting of >> leadership of JobMaster & ResourceManager and restart of the Flink job. >> >> >> >> *Observations* >> >> - After the first disruption of Zookeeper, JobMaster and >> ResourceManager were reset & were able to register with each other. >> Sharing >> few logs that confirm that. Flink job restarted successfully. >> >> org.apache.flink.runtime.jobmaster.JobMaster - Connecting to >> ResourceManager.... >> >> o.a.flink.runtime.resourcemanager.StandaloneResourceManager - >> Registering job manager.... >> >> o.a.flink.runtime.resourcemanager.StandaloneResourceManager - Registered >> job manager.... >> >> org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully >> registered at ResourceManager... >> >> - After another disruption later on the same Flink >> cluster, JobMaster & ResourceManager were not connected and below logs can >> be noticed and eventually scheduler times out. >> >> org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot >> request, no ResourceManager connected. >> >> ……… >> >> >> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: >> Could not allocate all requires slots within timeout of 300000 ms…… >> >> >> - I can confirm from the logs that both JobMaster & ResourceManager >> were running. JobMaster was trying to recover the job and ResourceManager >> registered the taskmanagers. >> - The odd thing is that the log for JobMaster trying to connect to >> ResourceManager is missing. So I assume JobMaster didn’t try to connect to >> ResourceManager. >> >> >> >> I can share more logs if required. >> >> >> >> Has anyone noticed similar behavior or is this a known issue with Flink >> 1.7.1? >> >> Any recommendations or suggestions on fix or workaround? >> >> >> >> Appreciate your time and help here. >> >> >> >> ~ Abhinav Bajaj >> >> >> >> >> >>