JobMaster does not register with ResourceManager in high availability setup

Bajaj, Abhinav Tue, 03 Mar 2020 13:43:22 -0800

Hi,

We recently came across an issue where JobMaster does not register with 
ResourceManager in Fink high availability setup.
Let me share the details below.


Setup

  *   Flink 1.7.1
  *   K8s
  *   High availability mode with a single Jobmanager and 3 zookeeper nodes in 
quorum.

Scenario

  *   Zookeeper pods are disrupted by K8s that leads to resetting of leadership 
of JobMaster & ResourceManager and restart of the Flink job.

Observations

  *   After the first disruption of Zookeeper, JobMaster and ResourceManager 
were reset & were able to register with each other. Sharing few logs that 
confirm that. Flink job restarted successfully.

org.apache.flink.runtime.jobmaster.JobMaster  - Connecting to 
ResourceManager....

o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering job 
manager....

o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered job 
manager....

org.apache.flink.runtime.jobmaster.JobMaster  - JobManager successfully 
registered at ResourceManager...

  *    After another disruption later on the same Flink cluster, JobMaster & 
ResourceManager were not connected and below logs can be noticed and eventually 
scheduler times out.
org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Cannot serve slot 
request, no ResourceManager connected.

       ………

        
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
Could not allocate all requires slots within timeout of 300000 ms……

  *   I can confirm from the logs that both JobMaster & ResourceManager were 
running. JobMaster was trying to recover the job and ResourceManager registered 
the taskmanagers.
  *   The odd thing is that the log for JobMaster trying to connect to 
ResourceManager is missing. So I assume JobMaster didn’t try to connect to 
ResourceManager.

I can share more logs if required.

Has anyone noticed similar behavior or is this a known issue with Flink 1.7.1?
Any recommendations or suggestions on fix or workaround?

Appreciate your time and help here.

~ Abhinav Bajaj

JobMaster does not register with ResourceManager in high availability setup

Reply via email to