Hi Abhinav,

The JobMaster log "Connecting to ResourceManager ..." is printed after
JobMaster retrieve ResourceManager address from ZooKeeper. In your case, I
assume there's some ZK problem that JM cannot resolve RM address.


Have you confirmed whether the ZK pods are recovered after the second
disruption? And does the address changed?


You can also try to enable debug logs for the following components, to see
if there's any useful information.

org.apache.flink.runtime.jobmaster

org.apache.flink.runtime.resourcemanager

org.apache.flink.runtime.highavailability

org.apache.flink.runtime.leaderretrieval

org.apache.zookeeper


Thank you~

Xintong Song



On Wed, Mar 4, 2020 at 5:42 AM Bajaj, Abhinav <abhinav.ba...@here.com>
wrote:

> Hi,
>
>
>
> We recently came across an issue where JobMaster does not register with
> ResourceManager in Fink high availability setup.
>
> Let me share the details below.
>
>
>
> *Setup*
>
>    - Flink 1.7.1
>    - K8s
>    - High availability mode with a *single* Jobmanager and 3 zookeeper
>    nodes in quorum.
>
>
>
> *Scenario*
>
>    - Zookeeper pods are disrupted by K8s that leads to resetting of
>    leadership of JobMaster & ResourceManager and restart of the Flink job.
>
>
>
> *Observations*
>
>    - After the first disruption of Zookeeper, JobMaster and
>    ResourceManager were reset & were able to register with each other. Sharing
>    few logs that confirm that. Flink job restarted successfully.
>
> org.apache.flink.runtime.jobmaster.JobMaster  - Connecting to
> ResourceManager....
>
> o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registering
> job manager....
>
> o.a.flink.runtime.resourcemanager.StandaloneResourceManager  - Registered
> job manager....
>
> org.apache.flink.runtime.jobmaster.JobMaster  - JobManager successfully
> registered at ResourceManager...
>
>    -  After another disruption later on the same Flink cluster, JobMaster
>    & ResourceManager were not connected and below logs can be noticed and
>    eventually scheduler times out.
>
> org.apache.flink.runtime.jobmaster.slotpool.SlotPool  - Cannot serve slot
> request, no ResourceManager connected.
>
>        ………
>
>         
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: 
> Could not allocate all requires slots within timeout of 300000 ms……
>
>
>    - I can confirm from the logs that both JobMaster & ResourceManager
>    were running. JobMaster was trying to recover the job and ResourceManager
>    registered the taskmanagers.
>    - The odd thing is that the log for JobMaster trying to connect to
>    ResourceManager is missing. So I assume JobMaster didn’t try to connect to
>    ResourceManager.
>
>
>
> I can share more logs if required.
>
>
>
> Has anyone noticed similar behavior or is this a known issue with Flink
> 1.7.1?
>
> Any recommendations or suggestions on fix or workaround?
>
>
>
> Appreciate your time and help here.
>
>
>
> ~ Abhinav Bajaj
>
>
>
>
>

Reply via email to