wgcn created FLINK-20138:
----------------------------

             Summary: Flink Job can not recover due to  timeout of requiring 
slots when flink jobmanager restarted
                 Key: FLINK-20138
                 URL: https://issues.apache.org/jira/browse/FLINK-20138
             Project: Flink
          Issue Type: Bug
          Components: Deployment / YARN, Table SQL / Runtime
         Environment: flink : 1.9.2
hadoop :2.7.2
jdk:1.8
            Reporter: wgcn
         Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png

our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

SlotPoolImp always did not connect ResourceManager 
```
2020-11-09 16:31:31,794                           INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
```

1.We did not find  the log of YarnResourceManager requesting container   at the 
jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to