Re: JobManager doesn't bring up new TaskManager during failure recovery

Yang Wang Sat, 23 Apr 2022 19:20:38 -0700

After more debugging, I think this issue is same as FLINK-24315[1],
which is fixed in 1.13.3.


[1]. https://issues.apache.org/jira/browse/FLINK-24315

Best,
Yang

Zheng, Chenyu <chenyu.zh...@disneystreaming.com> 于2022年4月22日周五 18:27写道：

> I created a JIRA ticket https://issues.apache.org/jira/browse/FLINK-27350
> to track this issue.
>
>
>
> BRs,
>
> Chenyu
>
>
>
> *From: *"Zheng, Chenyu" <chenyu.zh...@disneystreaming.com>
> *Date: *Friday, April 22, 2022 at 6:26 PM
> *To: *Yang Wang <danrtsey...@gmail.com>
> *Cc: *"user@flink.apache.org" <user@flink.apache.org>, "
> user...@flink.apache.org" <user...@flink.apache.org>
> *Subject: *Re: JobManager doesn't bring up new TaskManager during failure
> recovery
>
>
>
> Thank you, Yang!
>
>
>
> In fact I have a fine-grained dashboard for Kubernetes cluster health
> (like apiserver qps/latency etc.), and I didn't find anything unusual…
> Also, the JobManager container cpu/memory usage is low.
>
>
>
> Besides, I have a deep dive in these logs and Flink resource manager code,
> and find something interesting. I use taskmanager-1-9 to give you an
> example:
>
>    1. I can see logs “Requesting new worker with resource spec
>    WorkerResourceSpec” at 2022-04-17 00:33:15,333. And the code location is
>    here
>    
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L283&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OlH3iQ6OR4rjrRodaG38AihihsR9d7Fy1pqosGaBpqg%3D&reserved=0>
>    .
>    2. “Creating new TaskManager pod with name
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 and resource
>    <16384,4.0>” at 2022-04-17 00:33:15,376, code location
>    
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L167&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hIavko7ONdrzzC3icwg2rPfIJM7oRDBlToKpd1A3b30%3D&reserved=0>
>    .
>    3. “Pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 is
>    created.” at 2022-04-17 00:33:15,412, code location
>    
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L190&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uzWc65ZqnAcguJBlodWtiz6yoahV0TdAYPq95JMRV0A%3D&reserved=0>.
>    *The request is sent and pod is created here, so I think the apiserver
>    is healthy at that moment.*
>    4. But I cannot find any logs that print in line
>    
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L301&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uhvFoCWiQtlnRHu86bczN8J%2Btpq9H1QggZFZl%2FC%2BlAQ%3D&reserved=0>
>    and line
>    
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L314&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DY3JuIuu947uM9yCTq%2FKfY3jmVIJ8gS8SkzRP7O%2BLVA%3D&reserved=0>
>    .
>    5. “Discard registration from TaskExecutor
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9” at 2022-04-17
>    00:33:32,393. Root cause of this logs is due to the workerNodeMap
>    
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L81&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5l9zvFmy0qUS9W5pI4nDPv9CMCSpiDiMEjCri6J5prw%3D&reserved=0>
>    isn’t put a ResourceId that linked with taskmanager-1-9.
>
> That’s why I think things are strange here. Flink would put the ResourceId
> to workerNodeMap here
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L310&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=EBfrXuSGeOwtzr%2BNs%2B%2BpOXkRpy8AlRXE1KDGvx8P06M%3D&reserved=0>.
> *The code didn’t execute, although its Java future condition is reached
> and fulfilled. And I don’t see any code that related to Kubernetes events
> in this piece of logic.*
>
>
>
> By the way, in our expectation, would JobManager create new TaskManager in
> that case?
>
>
>
> BRs,
>
> Chenyu
>
>
>
> *From: *Yang Wang <danrtsey...@gmail.com>
> *Date: *Friday, April 22, 2022 at 4:49 PM
> *To: *"Zheng, Chenyu" <chenyu.zh...@disneystreaming.com>
> *Cc: *"user@flink.apache.org" <user@flink.apache.org>, "
> user...@flink.apache.org" <user...@flink.apache.org>
> *Subject: *Re: JobManager doesn't bring up new TaskManager during failure
> recovery
>
>
>
> The root cause might be you APIServer is overloaded or not running
> normally. And then all the pods events of
>
> taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in
> FlinkResourceManager.
>
> So the two taskmanagers are not recognized by ResourceManager and then
> registration are rejected.
>
>
>
> The ResourceManager also did not receive the terminated pod events. That's
> why it does not allocate new TaskManager pods.
>
>
>
> All in all, I believe you need to check the K8s APIServer status.
>
>
>
> Best,
>
> Yang
>
>
>
> Zheng, Chenyu <chenyu.zh...@disneystreaming.com> 于2022年4月22日周五 12:54写道：
>
> Hi developers!
>
>
>
> I got a strange bug during failure recovery of Flink. It seems the
> JobManager doesn't bring up new TaskManager during failure recovery. Some
> logs and information of the Flink job are pasted below. Can you take a look
> and give me some guidance? Thank you so much!
>
>
>
> Flink version: 1.13.2
>
> Deploy mode: K8s native
>
> Timeline of the bug:
>
>    1. Flink job start to work with 8 taskmanagers.
>    2. At *2022-04-17 00:28:15,286*, this job got an error and JobManager
>    decided to restart 2 tasks (pod
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1,
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
>    3. The two old pod is stopped and JobManager created 2 pod (pod
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9,
>    stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17
>    00:33:15,376*
>    4. JobManager discard two new pods’ registration at *2022-04-17
>    00:33:32,393*
>    5. These new pods exited at *2022-04-17 00:33:32,396*, due to the
>    rejection of registration.
>    6. JobManager didn’t bring up new pods and print error “Slot request
>    bulk is not fulfillable! Could not allocate the required slot within slot
>    request timeout” over and over
>
> Flink logs:
>
> 1.      JobManager:
> https://drive.google.com/file/d/1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ/view?usp=sharing
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=zAah4N%2BeDQrtY6qo9nEgut1KhbZC%2F8sK8YHn%2BNeKpzk%3D&reserved=0>
>
> 2.      TaskManager:
> https://drive.google.com/file/d/1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn/view?usp=sharing
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn%2Fview%3Fusp%3Dsharing&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C637862199881344449%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=govUqurtZA%2Ff5IxMOZml0VJ7lYHCq4R4YbVpsXzOChw%3D&reserved=0>
>
>
>
>
>
> BRs,
>
> Chenyu
>
>

Re: JobManager doesn't bring up new TaskManager during failure recovery

Reply via email to