Re: JobMaster does not register with ResourceManager in high availability setup

tison Sun, 22 Mar 2020 23:24:08 -0700

Hi,

It seems the leader info has been published but since you don't turn on
DEBUG log on


org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService

still we can only *guess* the retrieval service in JobMaster doesn't get
notified and even I don't see a INFO level log

Starting ZooKeeperLeaderRetrievalService ...

so that I suspect whether the retrieval service normally started.

Best,
tison.


Bajaj, Abhinav <[email protected]> 于2020年3月23日周一 下午1:55写道：

> Hi Yang, Tison,
>
>
>
> I think I was to reproduce the issue with a simpler job with DEBUG logs
> enabled on below classes –
>
> org.apache.flink.runtime.executiongraph.ExecutionGraph
>
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService
>
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint
>
> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager
>
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher
>
> org.apache.flink.runtime.jobmaster.JobManagerRunner
>
> org.apache.flink.runtime.jobmaster.JobMaster
>
>
>
> I have attached the logs.
>
>
>
> I highly appreciate the help.
>
>
>
> ~ Abhinav Bajaj
>
>
>
> *From: *Yang Wang <[email protected]>
> *Date: *Wednesday, March 18, 2020 at 12:14 AM
> *To: *tison <[email protected]>
> *Cc: *Xintong Song <[email protected]>, "Bajaj, Abhinav" <
> [email protected]>, "[email protected]" <[email protected]>
> *Subject: *Re: JobMaster does not register with ResourceManager in high
> availability setup
>
>
>
> It seems that your zookeeper service is not stable. From the the log i
> find that resourcemanager
>
> leader is granted and taskmanager could register to resourcemanager
> successfully. That means
>
> the resourcemanager address has been published to the ZK successfully.
>
>
>
> Also a ZooKeeperLeaderRetrievalService has been started successfully for
> the new started
>
> jobmaster. However, the ZK listener did not get notified for the new
> resourcemanager leader.
>
> So the jobmaster could not allocate resource from resourcemanager and
> failed with "NoResourceAvailableException".
>
>
>
> Just like tison said, i think you need to provide the jobmanager log with
> DEBUG level. Or try
>
> to make the ZK service as stable as possible.
>
>
>
>
>
> Best,
>
> Yang
>
>
>
> tison <[email protected]> 于2020年3月18日周三 上午11:20写道：
>
> Sorry I mixed up the log, it belongs to previous failure.
>
>
>
> Could you trying to reproduce the problem with DEBUG level log?
>
>
>
> From the log we knew that JM & RM had been elected as leader but the
> listener didn't work. However, we didn't know it is because the leader
> didn't publish the leader info or the listener didn't get notified.
>
>
>
> Best,
>
> tison.
>
>
>
>
>
> tison <[email protected]> 于2020年3月18日周三 上午10:40写道：
>
> Hi Abhinav,
>
>
>
> The problem is
>
>
>
> Curator: Background operation retry gave up
>
>
>
> So it is the ZK ensemble too unstable to get recovery in time so that
> Curator stopped retrying and threw a fatal error.
>
>
>
> Best,
>
> tison.
>
>
>
>
>
> Xintong Song <[email protected]> 于2020年3月18日周三 上午10:22写道：
>
> I'm not familiar with ZK either.
>
>
>
> I've copied Yang Wang, who might be able to provide some suggestions.
>
>
>
> Alternatively, you can try to post your question to the Apache ZooKeeper
> community, see if they have any clue.
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Wed, Mar 18, 2020 at 8:12 AM Bajaj, Abhinav <[email protected]>
> wrote:
>
> Hi Xintong,
>
>
>
> I did check the Zk logs and didn’t notice anything interesting.
>
> I have limited expertise in zookeeper.
>
> Can you share an example of what I should be looking for in Zk?
>
>
>
> I was able to reproduce this issue again with Flink 1.7 by killing the
> zookeeper leader that disrupted the quorum.
>
> The sequence of logs in this case look quite similar to one we have been
> discussing.
>
>
>
> If the code hasn’t changed in this area till 1.10 then maybe the latest
> version also has the potential issue.
>
>
>
> Its not straightforward to bump up the Flink version in the infrastructure
> available to me.
>
> But I will think if there is a way around it.
>
>
>
> ~ Abhinav Bajaj
>
>
>
> *From: *Xintong Song <[email protected]>
> *Date: *Monday, March 16, 2020 at 8:00 PM
> *To: *"Bajaj, Abhinav" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: JobMaster does not register with ResourceManager in high
> availability setup
>
>
>
> Hi Abhinav,
>
>
>
> I think you are right. The log confirms that JobMaster has not tried to
> connect ResourceManager. Most likely the JobMaster requested for RM address
> but has never received it.
>
>
>
> I would suggest you to check the ZK logs, see if the request form JM for
> RM address has been received and properly responded.
>
>
>
> If you can easily reproduce this problem, and you are able to build Flink
> from source, you can also try to insert more logs in Flink to further
> confirm whether the RM address is received. I don't think that's necessary
> though, since those codes have not been changed since Flink 1.7 till the
> latest 1.10, and I'm not aware of any reported issue that the JM may not
> try to connect RM once the address is received.
>
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Tue, Mar 17, 2020 at 7:45 AM Bajaj, Abhinav <[email protected]>
> wrote:
>
> Hi Xintong,
>
>
>
> Apologies for delayed response. I was away for a week.
>
> I am attaching more jobmanager logs.
>
>
>
> To your point on the taskmanagers, the job is deployed with 20 parallelism
> but it has 22 TMs to have 2 of them as spare to assist in quick failover.
>
> I did check the logs and all 22 of task executors from those TMs get
> registered by the time - 2020-02-27 06:35:47.050.
>
>
>
> You would notice that even after this time, the job fails with the error
> “org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms. Slots
> required: 201, slots allocated: 0” at 2020-02-27 06:40:36.778.
>
>
>
> Thanks a ton for you help.
>
>
>
> ~ Abhinav Bajaj
>
>
>
> *From: *Xintong Song <[email protected]>
> *Date: *Thursday, March 5, 2020 at 6:30 PM
> *To: *"Bajaj, Abhinav" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: JobMaster does not register with ResourceManager in high
> availability setup
>
>
>
> Hi Abhinav,
>
>
>
> Thanks for the log. However, the attached log seems to be incomplete.
> The NoResourceAvailableException cannot be found in this log.
>
>
>
> Regarding connecting to ResourceManager, the log suggests that:
>
>    - ZK was back to life and connected at 06:29:56.
>    2020-02-27 06:29:56.539 [main-EventThread] level=INFO
>     o.a.f.s.c.o.a.curator.framework.state.ConnectionStateManager  - State
>    change: CONNECTED
>    - RM registered to ZK and was granted leadership at 06:30:01.
>    2020-02-27 06:30:01.677 [flink-akka.actor.default-dispatcher-5]
>    level=INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  -
>    ResourceManager akka.tcp://flink@JOBMANAGER:6126/user/resourcemanager
>    was granted leadership with fencing token a2c453481ea4e0c7722cab1e4dd741db
>    - JM requests RM leader address from ZK at 06:30:06.
>    2020-02-27 06:30:06.272 [flink-akka.actor.default-dispatcher-17]
>    level=INFO  o.a.f.r.leaderretrieval.ZooKeeperLeaderRetrievalService  -
>    Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.
>    - The RM leader address will be notified asynchronously, and only
>    after that JM will try to connect to RM (printing the "Connecting to
>    ResourceManager" log). The attached log ends in 100ms after JM requesting
>    RM leader address, which is too short to tell whether the RM is connected
>    properly.
>
> Another finding is about the TM registration. According to the log:
>
>    - The parallelism of your job is 20, which means it needs 20 slots to
>    be executed.
>    - There are only 5 TMs registered. (Searching for "Registering
>    TaskManager with ResourceID")
>    - Assuming you have the same configurations for JM and TMs (this might
>    not always be true), you have one slot per TM.
>    599 2020-02-27 06:28:56.495 [main] level=INFO
>     org.apache.flink.configuration.GlobalConfiguration  - Loading
>    configuration property: taskmanager.numberOfTaskSlots, 1
>    - That suggests that it is possible that not all the TaskExecutors are
>    recovered/reconnected, leading to the NoResourceAvailableException. We
>    would need the rest part of the log (from where the current one ends to
>    the NoResourceAvailableException) to tell what happened during the
>    scheduling. Also, could you confirm how many TMs do you use?
>
>
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Fri, Mar 6, 2020 at 5:55 AM Bajaj, Abhinav <[email protected]>
> wrote:
>
> Hi Xintong,
>
>
>
> Highly appreciate your assistance here.
>
> I am attaching the jobmanager log for reference.
>
>
>
> Let me share my quick responses on what you mentioned.
>
>
>
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Cannot serve slot
> request, no ResourceManager connected.
>
> *XS: Sometimes you see this log because the ResourceManager is not yet
> connect when the slot request arrives the SlotPool. If the ResourceManager
> is connected later, the SlotPool will still send the pending slot requests,
> in that case you should find logs for SlotPool requesting slots from
> ResourceManager.*
>
>
>
> *AB*: Yes, I have noticed that behavior in scenarios where
> resourcemanager and jobmanager are connected successfully. The requests
> fail initially and they are served later when they are connected.
>
> I don’t think that happened in this case. But you have access to the
> jobmanager logs to check my understanding.
>
>
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms……
>
> *XS: This error message simply means that the slot requests are not
> satisfied in 5min. Various reasons might cause this problem.*
>
>    - *The ResourceManager is not connected at all.*
>
>
>    - *AB*: I think Resoucemanager is not connected to Jobmaster or vice
>       versa. My basis is *absence* of below logs –
>
>
>    - org.apache.flink.runtime.jobmaster.JobMaster  - Connecting to
>          ResourceManager....
>          - o.a.flink.runtime.resourcemanager.StandaloneResourceManager  -
>          Registering job manager....
>
>
>    - *The ResourceManager is connected, but some TaskExecutors are not
>    registered due to the ZK problem. *
>
>
>    - *AB*: I think the Task Executors were able to register or were in
>       the process of registering with ResourceManager.
>
>
>    - *ZK recovery takes too much time, so that despite all JM, RM, TMs
>    are able to connect to the ZK there might not be enough time to satisfy the
>    slot request before the timeout.*
>
>
>    - *AB*: To help check that may be you can use this log time
>
>
>    - 2020-02-27 06:29:53,732 [myid:1] - INFO
>          [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@64] - FOLLOWING -
>          LEADER ELECTION TOOK - 25069
>          - 2020-02-27 06:29:53,766 [myid:1] - INFO
>          [QuorumPeer[myid=1]/0.0.0.0:2181:Learner@332] - Getting a diff
>          from the leader 0x200002bf6
>
> Thanks a lot for looking into this.
>
> ~ Abhinav Bajaj
>
>
>
>
>
> *From: *Xintong Song <[email protected]>
> *Date: *Wednesday, March 4, 2020 at 7:17 PM
> *To: *"Bajaj, Abhinav" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: JobMaster does not register with ResourceManager in high
> availability setup
>
>
>
> Hi Abhinav,
>
>
>
> Do you mind sharing the complete 'jobmanager.log'?
>
>
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Cannot serve slot
> request, no ResourceManager connected.
>
> Sometimes you see this log because the ResourceManager is not yet connect
> when the slot request arrives the SlotPool. If the ResourceManager is
> connected later, the SlotPool will still send the pending slot requests, in
> that case you should find logs for SlotPool requesting slots from
> ResourceManager.
>
>
>
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Could not allocate all requires slots within timeout of 300000 ms……
>
> This error message simply means that the slot requests are not satisfied
> in 5min. Various reasons might cause this problem.
>
>    - The ResourceManager is not connected at all.
>    - The ResourceManager is connected, but some TaskExecutors are not
>    registered due to the ZK problem.
>    - ZK recovery takes too much time, so that despite all JM, RM, TMs are
>    able to connect to the ZK there might not be enough time to satisfy the
>    slot request before the timeout.
>
> It would need the complete 'jobmanager.log' (at least those from the job
> restart to the NoResourceAvailableException) to find out which is the case.
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Thu, Mar 5, 2020 at 7:30 AM Bajaj, Abhinav <[email protected]>
> wrote:
>
> While I setup to reproduce the issue with debug logs, I would like to
> share more information I noticed in INFO logs.
>
>
>
> Below is the sequence of events/exceptions I notice during the time
> zookeeper was disrupted.
>
> I apologize in advance as they are a bit verbose.
>
>
>
>    - Zookeeper seems to be down and leader election is disrupted –
>
>
>
> ·         2020-02-27 06:28:23.572 [Curator-ConnectionStateManager-0]
> level=WARN  o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  -
> Connection to ZooKeeper suspended. The contender 
> akka.tcp://flink@FOO_BAR:6126/user/resourcemanager
> no longer participates in the leader election.
>
> ·         2020-02-27 06:28:23.573 [Curator-ConnectionStateManager-0]
> level=INFO  org.apache.flink.runtime.jobmaster.JobManagerRunner  -
> JobManager for job FOO_BAR (5a910928a71b469a091be168b0e74722) was revoked
> leadership at akka.tcp://flink@ FOO_BAR:6126/user/jobmanager_1.
>
> ·         2020-02-27 06:28:23.573
> [flink-akka.actor.default-dispatcher-9897] level=INFO
> o.a.flink.runtime.resourcemanager.StandaloneResourceManager  -
> ResourceManager akka.tcp://flink@ FOO_BAR:6126/user/resourcemanager was
> revoked leadership. Clearing fencing token.
>
> ·         2020-02-27 06:28:23.574
> [flink-akka.actor.default-dispatcher-9897] level=INFO
> o.a.flink.runtime.resourcemanager.slotmanager.SlotManager  - Suspending the
> SlotManager.
>
> ·         2020-02-27 06:28:53.577 [Curator-Framework-0] level=ERROR
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Fatal error
> occurred in the cluster entrypoint.
>
> org.apache.flink.runtime.dispatcher.DispatcherException: Received an error
> from the LeaderElectionService.
>
>         at
> org.apache.flink.runtime.dispatcher.Dispatcher.handleError(Dispatcher.java:941)
>
>         at
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.unhandledError(ZooKeeperLeaderElectionService.java:416)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:576)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$6.apply(CuratorFrameworkImpl.java:572)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.logError(CuratorFrameworkImpl.java:571)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:740)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:857)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>
>         at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>
>         at java.lang.Thread.run(Thread.java:748)
>
> Caused by: org.apache.flink.util.FlinkException: Unhandled error in
> ZooKeeperLeaderElectionService: Background operation retry gave up
>
>         ... 18 common frames omitted
>
> Caused by:
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException$ConnectionLossException:
> KeeperErrorCode = ConnectionLoss
>
>         at
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:728)
>
>         ... 10 common frames omitted
>
>
>
>    - ClusterEntrypoint restarts and tries to connect to Zookeeper. It
>    seems its fails for some time but able to connect later -
>
>
>
> ·         2020-02-27 06:28:56.467 [main] level=INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -  Starting
> StandaloneSessionClusterEntrypoint (Version: 1.7.1, Rev:89eafb4,
> Date:14.12.2018 @ 15:48:34 GMT)
>
> ·         2020-02-27 06:29:16.477 [main] level=ERROR
> o.a.flink.shaded.curator.org.apache.curator.ConnectionState  - Connection
> timed out for connection string (ZOO_BAR_0:2181, ZOO_BAR_1:2181,
> ZOO_BAR_2:2181) and timeout (15000) / elapsed (15969)
>
> org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException:
> KeeperErrorCode = ConnectionLoss
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl$1.call(NamespaceImpl.java:90)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.NamespaceImpl.fixForNamespace(NamespaceImpl.java:83)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.fixForNamespace(CuratorFrameworkImpl.java:594)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:158)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.ExistsBuilderImpl.forPath(ExistsBuilderImpl.java:32)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.reset(NodeCache.java:242)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:175)
>
>         at
> org.apache.flink.shaded.curator.org.apache.curator.framework.recipes.cache.NodeCache.start(NodeCache.java:154)
>
>         at
> org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService.start(ZooKeeperLeaderElectionService.java:134)
>
>         at
> org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.startInternal(WebMonitorEndpoint.java:712)
>
>         at
> org.apache.flink.runtime.rest.RestServerEndpoint.start(RestServerEndpoint.java:218)
>
>         at
> org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:145)
>
>         at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:215)
>
>         at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:163)
>
>         at
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>
>         at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:162)
>
>         at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:517)
>
>         at
> org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
>
> ·         2020-02-27 06:30:01.643 [main] level=INFO
> o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting
> ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}.
>
> ·         2020-02-27 06:30:01.655 [main] level=INFO
> o.a.f.runtime.leaderelection.ZooKeeperLeaderElectionService  - Starting
> ZooKeeperLeaderElectionService
> ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}.
>
> ·         2020-02-27 06:30:01.677
> [flink-akka.actor.default-dispatcher-16] level=INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher
> akka.tcp://flink@FOO_BAR:6126/user/dispatcher was granted leadership with
> fencing token 113d78b5-6c33-401b-9f47-2f7a1d6dfefc
>
> ·         2020-02-27 06:30:01.677 [flink-akka.actor.default-dispatcher-5]
> level=INFO  o.a.flink.runtime.resourcemanager.StandaloneResourceManager  -
> ResourceManager akka.tcp://flink@FOO_BAR:6126/user/resourcemanager was
> granted leadership with fencing token a2c453481ea4e0c7722cab1e4dd741db
>
> ·         2020-02-27 06:30:06.251 [main-EventThread] level=INFO
> org.apache.flink.runtime.jobmaster.JobManagerRunner  - JobManager runner
> for job HHW_WEATHER_PROCESSOR (5a910928a71b469a091be168b0e74722) was
> granted leadership with session id 32637c78-cfd4-4e44-8c10-c551bac40742 at
> akka.tcp://flink@FOO_BAR:6126/user/jobmanager_0.
>
>
>
> My zookeeper knowledge is a bit limited but I do notice that the below log
> from zookeeper instance came back up and joined the quorum as follower
> before the above highlighted logs on jobmanager side.
>
> ·         2020-02-27 06:29:53,732 [myid:1] - INFO
> [QuorumPeer[myid=1]/0.0.0.0:2181:Follower@64] - FOLLOWING - LEADER
> ELECTION TOOK - 25069
>
> ·         2020-02-27 06:29:53,766 [myid:1] - INFO
> [QuorumPeer[myid=1]/0.0.0.0:2181:Learner@332] - Getting a diff from the
> leader 0x200002bf6
>
>
>
>
>
> I will setup to reproduce this issue and get debug logs as well.
>
>
>
> But in meantime, does the above hightlighted logs confirm that zookeeper
> become available around that time?
>
> I don’t see any logs from JobMaster complaining for not being able to
> connect to zookeeper after that.
>
>
>
> ~ Abhinav Bajaj
>
>
>
> *From: *"Bajaj, Abhinav" <[email protected]>
> *Date: *Wednesday, March 4, 2020 at 12:01 PM
> *To: *Xintong Song <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: JobMaster does not register with ResourceManager in high
> availability setup
>
>
>
> Thanks Xintong for pointing that out.
>
>
>
> I will dig deeper and get back with my findings.
>
>
>
> ~ Abhinav Bajaj
>
>
>
> *From: *Xintong Song <[email protected]>
> *Date: *Tuesday, March 3, 2020 at 7:36 PM
> *To: *"Bajaj, Abhinav" <[email protected]>
> *Cc: *"[email protected]" <[email protected]>
> *Subject: *Re: JobMaster does not register with ResourceManager in high
> availability setup
>
>
>
> Hi Abhinav,
>
>
> The JobMaster log "Connecting to ResourceManager ..." is printed after
> JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I
> assume there's some ZK problem that JM cannot resolve RM address.
>
>
>
> Have you confirmed whether the ZK pods are recovered after the second
> disruption? And does the address changed?
>
>
>
> You can also try to enable debug logs for the following components, to see
> if there's any useful information.
>
> org.apache.flink.runtime.jobmaster
>
> org.apache.flink.runtime.resourcemanager
>
> org.apache.flink.runtime.highavailability
>
> org.apache.flink.runtime.leaderretrieval
>
> org.apache.zookeeper
>
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
>
> On Wed, Mar 4, 2020 at 5:42 AM Bajaj, Abhinav <[email protected]>
> wrote:
>
> Hi,
>
>
>
> We recently came across an issue where JobMaster does not register with
> ResourceManager in Fink high availability setup.
>
> Let me share the details below.
>
>
>
> *Setup*
>
>    - Flink 1.7.1
>    - K8s
>    - High availability mode with a *single* Jobmanager and 3 zookeeper
>    nodes in quorum.
>
>
>
> *Scenario*
>
>    - Zookeeper pods are disrupted by K8s that leads to resetting of
>    leadership of JobMaster & ResourceManager and restart of the Flink job.
>
>
>
> *Observations*
>
>    - After the first disruption of Zookeeper, JobMaster and
>    ResourceManager were reset & were able to register with each other. Sharing
>    few logs that confirm that. Flink job restarted successfully.
>
> org.apache.flink.runtime.jobmaster.JobMaster  - Connecting to
> ResourceManager....
>
> o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering
> job manager....
>
> o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered
> job manager....
>
> org.apache.flink.runtime.jobmaster.JobMaster  - JobManager successfully
> registered at ResourceManager...
>
>    -  After another disruption later on the same Flink cluster, JobMaster
>    & ResourceManager were not connected and below logs can be noticed and
>    eventually scheduler times out.
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Cannot serve slot
> request, no ResourceManager connected.
>
>        ………
>
>         
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate all requires slots within timeout of 300000 ms……
>
>
>    - I can confirm from the logs that both JobMaster & ResourceManager
>    were running. JobMaster was trying to recover the job and ResourceManager
>    registered the taskmanagers.
>    - The odd thing is that the log for JobMaster trying to connect to
>    ResourceManager is missing. So I assume JobMaster didn’t try to connect to
>    ResourceManager.
>
>
>
> I can share more logs if required.
>
>
>
> Has anyone noticed similar behavior or is this a known issue with Flink
> 1.7.1?
>
> Any recommendations or suggestions on fix or workaround?
>
>
>
> Appreciate your time and help here.
>
>
>
> ~ Abhinav Bajaj
>
>
>
>
>
>

Re: JobMaster does not register with ResourceManager in high availability setup

Reply via email to