[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

wgcn (Jira) Thu, 12 Nov 2020 19:35:23 -0800


     [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


wgcn updated FLINK-20138:
-------------------------
    Description: 
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

*SlotPoolImp always did not connect ResourceManager *
```
+_
2020-11-09 16:31:31,794                           INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
_+
```

*1.We did not find  the log of YarnResourceManager requesting container   at 
the jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .*



  was:
our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
jobs  can not recover due to  timeout of requiring slots.

SlotPoolImp always did not connect ResourceManager 
```

2020-11-09 16:31:31,794                           INFO 
flink-akka.actor.default-dispatcher-16 
(org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
 - Cannot serve slot request, no ResourceManager connected. Adding as pending 
request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]

```

1.We did not find  the log of YarnResourceManager requesting container   at the 
jobmanager log of attachment. 
2.The node  of Zookeeper is also  showed at attachment .




> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-20138
>                 URL: https://issues.apache.org/jira/browse/FLINK-20138
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Table SQL / Runtime
>         Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>            Reporter: wgcn
>            Priority: Major
>         Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, jobmanager.log
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> +_
> 2020-11-09 16:31:31,794                           INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> _+
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

Reply via email to