[ https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232000#comment-17232000 ]
wgcn edited comment on FLINK-20138 at 11/14/20, 12:13 PM: ---------------------------------------------------------- [~trohrmann] , the log 'Connecting to ResourceManager xxxxx' does not appear in the jobmanager. [Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in next version. was (Author: 1026688210): [~trohrmann] , the log 'Connecting to ResourceManager xxxxx' does not appear in the jobmanager. [Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt ime/jobmaster/JobMaster.java#L931-L936], we guess the change event of resourcemanager latch node in zookeeper did not inform jobmaster in a bad network environment. I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] make a improvement on zookeeper , will it be finished in latest version. > Flink Job can not recover due to timeout of requiring slots when flink > jobmanager restarted > -------------------------------------------------------------------------------------------- > > Key: FLINK-20138 > URL: https://issues.apache.org/jira/browse/FLINK-20138 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN, Table SQL / Runtime > Environment: flink : 1.9.2 > hadoop :2.7.2 > jdk:1.8 > Reporter: wgcn > Priority: Major > Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, > jobmanager.log, zk_resource_address_info.png > > > our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger machines > ,and AMs of the machines restarted at other nodemanager. We found some > jobs can not recover due to timeout of requiring slots. > *SlotPoolImp always did not connect ResourceManager * > ``` > 2020-11-09 16:31:31,794 INFO > flink-akka.actor.default-dispatcher-16 > (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369) > - Cannot serve slot request, no ResourceManager connected. Adding as pending > request [SlotRequestId{456c9daa6670a4490810f8e51f495174}] > ``` > *1.We did not find the log of YarnResourceManager requesting container at > the jobmanager log of attachment. > 2.The node of Zookeeper is also showed at attachment .* -- This message was sent by Atlassian Jira (v8.3.4#803005)