Hi, We recently came across an issue where JobMaster does not register with ResourceManager in Fink high availability setup. Let me share the details below.
Setup * Flink 1.7.1 * K8s * High availability mode with a single Jobmanager and 3 zookeeper nodes in quorum. Scenario * Zookeeper pods are disrupted by K8s that leads to resetting of leadership of JobMaster & ResourceManager and restart of the Flink job. Observations * After the first disruption of Zookeeper, JobMaster and ResourceManager were reset & were able to register with each other. Sharing few logs that confirm that. Flink job restarted successfully. org.apache.flink.runtime.jobmaster.JobMaster - Connecting to ResourceManager.... o.a.flink.runtime.resourcemanager.StandaloneResourceManager - Registering job manager.... o.a.flink.runtime.resourcemanager.StandaloneResourceManager - Registered job manager.... org.apache.flink.runtime.jobmaster.JobMaster - JobManager successfully registered at ResourceManager... * After another disruption later on the same Flink cluster, JobMaster & ResourceManager were not connected and below logs can be noticed and eventually scheduler times out. org.apache.flink.runtime.jobmaster.slotpool.SlotPool - Cannot serve slot request, no ResourceManager connected. ……… org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms…… * I can confirm from the logs that both JobMaster & ResourceManager were running. JobMaster was trying to recover the job and ResourceManager registered the taskmanagers. * The odd thing is that the log for JobMaster trying to connect to ResourceManager is missing. So I assume JobMaster didn’t try to connect to ResourceManager. I can share more logs if required. Has anyone noticed similar behavior or is this a known issue with Flink 1.7.1? Any recommendations or suggestions on fix or workaround? Appreciate your time and help here. ~ Abhinav Bajaj