After more debugging, I think this issue is same as FLINK-24315[1], which is fixed in 1.13.3.
[1]. https://issues.apache.org/jira/browse/FLINK-24315 Best, Yang Zheng, Chenyu <chenyu.zh...@disneystreaming.com> 于2022年4月22日周五 18:27写道: > I created a JIRA ticket https://issues.apache.org/jira/browse/FLINK-27350 > to track this issue. > > > > BRs, > > Chenyu > > > > *From: *"Zheng, Chenyu" <chenyu.zh...@disneystreaming.com> > *Date: *Friday, April 22, 2022 at 6:26 PM > *To: *Yang Wang <danrtsey...@gmail.com> > *Cc: *"user@flink.apache.org" <user@flink.apache.org>, " > user...@flink.apache.org" <user...@flink.apache.org> > *Subject: *Re: JobManager doesn't bring up new TaskManager during failure > recovery > > > > Thank you, Yang! > > > > In fact I have a fine-grained dashboard for Kubernetes cluster health > (like apiserver qps/latency etc.), and I didn't find anything unusual… > Also, the JobManager container cpu/memory usage is low. > > > > Besides, I have a deep dive in these logs and Flink resource manager code, > and find something interesting. I use taskmanager-1-9 to give you an > example: > > 1. I can see logs “Requesting new worker with resource spec > WorkerResourceSpec” at 2022-04-17 00:33:15,333. And the code location is > here > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L283&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OlH3iQ6OR4rjrRodaG38AihihsR9d7Fy1pqosGaBpqg%3D&reserved=0> > . > 2. “Creating new TaskManager pod with name > stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 and resource > <16384,4.0>” at 2022-04-17 00:33:15,376, code location > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L167&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hIavko7ONdrzzC3icwg2rPfIJM7oRDBlToKpd1A3b30%3D&reserved=0> > . > 3. “Pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 is > created.” at 2022-04-17 00:33:15,412, code location > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L190&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uzWc65ZqnAcguJBlodWtiz6yoahV0TdAYPq95JMRV0A%3D&reserved=0>. > *The request is sent and pod is created here, so I think the apiserver > is healthy at that moment.* > 4. But I cannot find any logs that print in line > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L301&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uhvFoCWiQtlnRHu86bczN8J%2Btpq9H1QggZFZl%2FC%2BlAQ%3D&reserved=0> > and line > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L314&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DY3JuIuu947uM9yCTq%2FKfY3jmVIJ8gS8SkzRP7O%2BLVA%3D&reserved=0> > . > 5. “Discard registration from TaskExecutor > stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9” at 2022-04-17 > 00:33:32,393. Root cause of this logs is due to the workerNodeMap > > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L81&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5l9zvFmy0qUS9W5pI4nDPv9CMCSpiDiMEjCri6J5prw%3D&reserved=0> > isn’t put a ResourceId that linked with taskmanager-1-9. > > That’s why I think things are strange here. Flink would put the ResourceId > to workerNodeMap here > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L310&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EBfrXuSGeOwtzr%2BNs%2B%2BpOXkRpy8AlRXE1KDGvx8P06M%3D&reserved=0>. > *The code didn’t execute, although its Java future condition is reached > and fulfilled. And I don’t see any code that related to Kubernetes events > in this piece of logic.* > > > > By the way, in our expectation, would JobManager create new TaskManager in > that case? > > > > BRs, > > Chenyu > > > > *From: *Yang Wang <danrtsey...@gmail.com> > *Date: *Friday, April 22, 2022 at 4:49 PM > *To: *"Zheng, Chenyu" <chenyu.zh...@disneystreaming.com> > *Cc: *"user@flink.apache.org" <user@flink.apache.org>, " > user...@flink.apache.org" <user...@flink.apache.org> > *Subject: *Re: JobManager doesn't bring up new TaskManager during failure > recovery > > > > The root cause might be you APIServer is overloaded or not running > normally. And then all the pods events of > > taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in > FlinkResourceManager. > > So the two taskmanagers are not recognized by ResourceManager and then > registration are rejected. > > > > The ResourceManager also did not receive the terminated pod events. That's > why it does not allocate new TaskManager pods. > > > > All in all, I believe you need to check the K8s APIServer status. > > > > Best, > > Yang > > > > Zheng, Chenyu <chenyu.zh...@disneystreaming.com> 于2022年4月22日周五 12:54写道: > > Hi developers! > > > > I got a strange bug during failure recovery of Flink. It seems the > JobManager doesn't bring up new TaskManager during failure recovery. Some > logs and information of the Flink job are pasted below. Can you take a look > and give me some guidance? Thank you so much! > > > > Flink version: 1.13.2 > > Deploy mode: K8s native > > Timeline of the bug: > > 1. Flink job start to work with 8 taskmanagers. > 2. At *2022-04-17 00:28:15,286*, this job got an error and JobManager > decided to restart 2 tasks (pod > stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1, > stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7) > 3. The two old pod is stopped and JobManager created 2 pod (pod > stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9, > stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17 > 00:33:15,376* > 4. JobManager discard two new pods’ registration at *2022-04-17 > 00:33:32,393* > 5. These new pods exited at *2022-04-17 00:33:32,396*, due to the > rejection of registration. > 6. JobManager didn’t bring up new pods and print error “Slot request > bulk is not fulfillable! Could not allocate the required slot within slot > request timeout” over and over > > Flink logs: > > 1. JobManager: > https://drive.google.com/file/d/1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ/view?usp=sharing > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zAah4N%2BeDQrtY6qo9nEgut1KhbZC%2F8sK8YHn%2BNeKpzk%3D&reserved=0> > > 2. TaskManager: > https://drive.google.com/file/d/1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn/view?usp=sharing > <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=govUqurtZA%2Ff5IxMOZml0VJ7lYHCq4R4YbVpsXzOChw%3D&reserved=0> > > > > > > BRs, > > Chenyu > >