[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
wgcn updated FLINK-20138: ------------------------- Description: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. *SlotPoolImp always did not connect ResourceManager * ``` +_ 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] _+ ``` *1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment .* was: our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines ,and AMs of the machines restarted at other nodemanager. We found some jobs can not recover due to timeout of requiring slots. SlotPoolImp always did not connect ResourceManager ``` 2020-11-09 16:31:31,794 INFO flink-akka.actor.default-dispatcher-16 (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) - Cannot serve slot request, no ResourceManager connected. Adding as pending request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] ``` 1.We did not find the log of YarnResourceManager requesting container at the jobmanager log of attachment. 2.The node of Zookeeper is also showed at attachment . > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > -------------------------------------------------------------------------------------------- > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 > Reporter: wgcn > Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > +_ > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > _+ > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)