Thank you so much, Yang! It looks very likely that this issue is causing the bug I met. I’ll upgrade my Flink version and test if I can reproduce that bug.
BRs, Chenyu From: Yang Wang <danrtsey...@gmail.com> Date: Sunday, April 24, 2022 at 10:20 AM To: "Zheng, Chenyu" <chenyu.zh...@disneystreaming.com> Cc: "user@flink.apache.org" <user@flink.apache.org>, "user...@flink.apache.org" <user...@flink.apache.org> Subject: Re: JobManager doesn't bring up new TaskManager during failure recovery After more debugging, I think this issue is same as FLINK-24315[1], which is fixed in 1.13.3. [1]. https://issues.apache.org/jira/browse/FLINK-24315<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-24315&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=pa4bmwIs4H2zFDOi7pnVIrseEMp%2F%2BDco%2FBvCS%2B2KtVs%3D&reserved=0> Best, Yang Zheng, Chenyu <chenyu.zh...@disneystreaming.com<mailto:chenyu.zh...@disneystreaming.com>> 于2022年4月22日周五 18:27写道: I created a JIRA ticket https://issues.apache.org/jira/browse/FLINK-27350<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FFLINK-27350&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kmbNCAZcyWlDEZ3zoT0rMvn%2B%2BkH3lsrBl9gWD5MCv%2B4%3D&reserved=0> to track this issue. BRs, Chenyu From: "Zheng, Chenyu" <chenyu.zh...@disneystreaming.com<mailto:chenyu.zh...@disneystreaming.com>> Date: Friday, April 22, 2022 at 6:26 PM To: Yang Wang <danrtsey...@gmail.com<mailto:danrtsey...@gmail.com>> Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" <user@flink.apache.org<mailto:user@flink.apache.org>>, "user...@flink.apache.org<mailto:user...@flink.apache.org>" <user...@flink.apache.org<mailto:user...@flink.apache.org>> Subject: Re: JobManager doesn't bring up new TaskManager during failure recovery Thank you, Yang! In fact I have a fine-grained dashboard for Kubernetes cluster health (like apiserver qps/latency etc.), and I didn't find anything unusual… Also, the JobManager container cpu/memory usage is low. Besides, I have a deep dive in these logs and Flink resource manager code, and find something interesting. I use taskmanager-1-9 to give you an example: 1. I can see logs “Requesting new worker with resource spec WorkerResourceSpec” at 2022-04-17 00:33:15,333. And the code location is here<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L283&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DB12qSXtjS%2BP5iQR2%2FNtJwA2Mg8g9W3PaorJSG5u5MQ%3D&reserved=0>. 2. “Creating new TaskManager pod with name stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 and resource <16384,4.0>” at 2022-04-17 00:33:15,376, code location<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L167&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=XqZ3izSq5GKw2%2F9ZNTp2nLR4o5RfIqd70KH4ABs9VAY%3D&reserved=0>. 3. “Pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 is created.” at 2022-04-17 00:33:15,412, code location<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L190&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KU%2BqJZObpVzHhoRcyJJDn%2FHjBmOdF3FGzMI6DMSys7w%3D&reserved=0>. The request is sent and pod is created here, so I think the apiserver is healthy at that moment. 4. But I cannot find any logs that print in line<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L301&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=LNX35fI%2FvkUanHZM3e%2F%2BHSoRr0LS7kU2qgChFg6EZGU%3D&reserved=0> and line<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L314&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DEx%2BziLJveBjGoBS1X9VbFqwNBxHlR0J7vPXhJEvrdY%3D&reserved=0>. 5. “Discard registration from TaskExecutor stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9” at 2022-04-17 00:33:32,393. Root cause of this logs is due to the workerNodeMap<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L81&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=j5MFCzQnpKhI4rTEilrqi6TRMtSwFUGkHTfErE2%2Ftc0%3D&reserved=0> isn’t put a ResourceId that linked with taskmanager-1-9. That’s why I think things are strange here. Flink would put the ResourceId to workerNodeMap here<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L310&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=F9TXPgJOuHEsxcvmAkuCa3QBfzPwK17ZAas%2FIKI9Lt4%3D&reserved=0>. The code didn’t execute, although its Java future condition is reached and fulfilled. And I don’t see any code that related to Kubernetes events in this piece of logic. By the way, in our expectation, would JobManager create new TaskManager in that case? BRs, Chenyu From: Yang Wang <danrtsey...@gmail.com<mailto:danrtsey...@gmail.com>> Date: Friday, April 22, 2022 at 4:49 PM To: "Zheng, Chenyu" <chenyu.zh...@disneystreaming.com<mailto:chenyu.zh...@disneystreaming.com>> Cc: "user@flink.apache.org<mailto:user@flink.apache.org>" <user@flink.apache.org<mailto:user@flink.apache.org>>, "user...@flink.apache.org<mailto:user...@flink.apache.org>" <user...@flink.apache.org<mailto:user...@flink.apache.org>> Subject: Re: JobManager doesn't bring up new TaskManager during failure recovery The root cause might be you APIServer is overloaded or not running normally. And then all the pods events of taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in FlinkResourceManager. So the two taskmanagers are not recognized by ResourceManager and then registration are rejected. The ResourceManager also did not receive the terminated pod events. That's why it does not allocate new TaskManager pods. All in all, I believe you need to check the K8s APIServer status. Best, Yang Zheng, Chenyu <chenyu.zh...@disneystreaming.com<mailto:chenyu.zh...@disneystreaming.com>> 于2022年4月22日周五 12:54写道: Hi developers! I got a strange bug during failure recovery of Flink. It seems the JobManager doesn't bring up new TaskManager during failure recovery. Some logs and information of the Flink job are pasted below. Can you take a look and give me some guidance? Thank you so much! Flink version: 1.13.2 Deploy mode: K8s native Timeline of the bug: 1. Flink job start to work with 8 taskmanagers. 2. At 2022-04-17 00:28:15,286, this job got an error and JobManager decided to restart 2 tasks (pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1, stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7) 3. The two old pod is stopped and JobManager created 2 pod (pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9, stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at 2022-04-17 00:33:15,376 4. JobManager discard two new pods’ registration at 2022-04-17 00:33:32,393 5. These new pods exited at 2022-04-17 00:33:32,396, due to the rejection of registration. 6. JobManager didn’t bring up new pods and print error “Slot request bulk is not fulfillable! Could not allocate the required slot within slot request timeout” over and over Flink logs: 1. JobManager: https://drive.google.com/file/d/1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ/view?usp=sharing<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=y%2By8rrXJwhoe8m38jOWY1e7%2BhZVeCQZqi2w%2Bs2VF9tc%3D&reserved=0> 2. TaskManager: https://drive.google.com/file/d/1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn/view?usp=sharing<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cacd94bd7cc3c4136b5a408da2599028c%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637863636356983537%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Zel1KfqjxaJy2MrdiHvXY%2BmQ%2F7SYUyrD1R1Po%2FiVGas%3D&reserved=0> BRs, Chenyu