Chenyu Zheng created FLINK-27350:
------------------------------------

             Summary: JobManager doesn't bring up new TaskManager during 
failure recovery
                 Key: FLINK-27350
                 URL: https://issues.apache.org/jira/browse/FLINK-27350
             Project: Flink
          Issue Type: Bug
            Reporter: Chenyu Zheng
         Attachments: jobmanager.log, 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10.log

I got a strange bug during failure recovery of Flink. It seems the JobManager 
doesn't bring up new TaskManager during failure recovery. Some logs and 
information of the Flink job are pasted below. Can you take a look and give me 
some guidance? Thank you so much!

 

Flink version: 1.13.2

Deploy mode: K8s native

Timeline of the bug:
 # Flink job start to work with 8 taskmanagers.
 # At {*}2022-04-17 00:28:15,286{*}, this job got an error and JobManager 
decided to restart 2 tasks (pod 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1, 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
 # The two old pod is stopped and JobManager created 2 pod (pod 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9, 
stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17 
00:33:15,376*
 # JobManager discard two new pods’ registration at *2022-04-17 00:33:32,393*
 # These new pods exited at {*}2022-04-17 00:33:32,396{*}, due to the rejection 
of registration.
 # JobManager didn’t bring up new pods and print error “Slot request bulk is 
not fulfillable! Could not allocate the required slot within slot request 
timeout” over and over



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to