[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

wgcn (Jira) Sat, 14 Nov 2020 04:18:22 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-20138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17232000#comment-17232000
 ]


wgcn edited comment on FLINK-20138 at 11/14/20, 12:17 PM:
----------------------------------------------------------

[~trohrmann] , the log 'Connecting to ResourceManager xxxxx'  does not appear  
in the  jobmanager.
[Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on curator  , will it be finished in next version.


was (Author: 1026688210):
[~trohrmann] , the log 'Connecting to ResourceManager xxxxx'  does not appear  
in the  jobmanager.
[Jobmaster.connectToResourceManager|https://github.com/apache/flink/blob/c9d2c9098d725a2d39e860bde414ecb0c5d6a233/flink-runtime/src/main/java/org/apache/flink/runt
 ime/jobmaster/JobMaster.java#L931-L936], 
we guess  the  change event of  resourcemanager latch node  in zookeeper  did 
not  inform  jobmaster in  a  bad network environment. 
I found the [FLINK-10052|https://issues.apache.org/jira/browse/FLINK-10052] 
make a  improvement on zookeeper , will it be finished in next version.

> Flink Job can not recover due to  timeout of requiring slots when flink 
> jobmanager restarted
> --------------------------------------------------------------------------------------------
>
>                 Key: FLINK-20138
>                 URL: https://issues.apache.org/jira/browse/FLINK-20138
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / YARN, Table SQL / Runtime
>         Environment: flink : 1.9.2
> hadoop :2.7.2
> jdk:1.8
>            Reporter: wgcn
>            Priority: Major
>         Attachments: 2820F7EE-85F9-441D-95D5-8163FB6267DF.png, 
> jobmanager.log, zk_resource_address_info.png
>
>
> our flink jobs run on Yarn Perjob Mode. We stoped some nodemanger  machines  
> ,and   AMs of  the  machines  restarted at other nodemanager.  We found  some 
> jobs  can not recover due to  timeout of requiring slots.
> *SlotPoolImp always did not connect ResourceManager *
> ```
> 2020-11-09 16:31:31,794                           INFO 
> flink-akka.actor.default-dispatcher-16 
> (org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl.stashRequestWaitingForResourceManager:369)
>  - Cannot serve slot request, no ResourceManager connected. Adding as pending 
> request [SlotRequestId{456c9daa6670a4490810f8e51f495174}]
> ```
> *1.We did not find  the log of YarnResourceManager requesting container   at 
> the jobmanager log of attachment. 
> 2.The node  of Zookeeper is also  showed at attachment .*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (FLINK-20138) Flink Job can not recover due to timeout of requiring slots when flink jobmanager restarted

Reply via email to