[ https://issues.apache.org/jira/browse/FLINK-17176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Canbin Zheng updated FLINK-17176: --------------------------------- Description: In the native K8s setups, there are some cases that we do not control the speed of pod re-creation which poses potential risks to flood the K8s API Server in the {{PodCallbackHandler}} implementation of {{KubernetesResourceManager.}} Here are steps to reproduce this kind of problems: # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on the K8s nodes via HostPath, make sure that the path exists but the TaskManager process has no write permission. We can achieve this via the [user-specified pod template support|https://issues.apache.org/jira/browse/FLINK-15656] or just hardcode it for testing only. # Launch a session cluster # Submit a new job to the session cluster, as expected, we can observe that the Pod constantly fails quickly during launching the main Container, then the {{KubernetesResourceManager#onModified}} is invoked to re-create a new Pod immediately, without any speed control. To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* event and that Pod is terminated before successfully registering into the {{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send another creation request to K8s API Server immediately. was: In the native K8s setups, there are some cases that we do not control the speed of pod re-creation which poses potential risks to flood the K8s API Server in the {{PodCallbackHandler}} implementation of {{KubernetesResourceManager.}} Here are steps to reproduce this kind of problems: # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on the K8s nodes via HostPath, make sure that the path exists but the TaskManager process has no write permission. We can achieve this via the user-specified pod template support or just hardcode it for testing only. # Launch a session cluster # Submit a new job to the session cluster, as expected, we can observe that the Pod constantly fails quickly during launching the main Container, then the {{KubernetesResourceManager#onModified}} is invoked to re-create a new Pod immediately, without any speed control. To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* event and that Pod is terminated before successfully registering into the {{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send another creation request to K8s API Server immediately. > Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler > ------------------------------------------------------------------------- > > Key: FLINK-17176 > URL: https://issues.apache.org/jira/browse/FLINK-17176 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes > Affects Versions: 1.10.0 > Reporter: Canbin Zheng > Priority: Major > Fix For: 1.11.0 > > > In the native K8s setups, there are some cases that we do not control the > speed of pod re-creation which poses potential risks to flood the K8s API > Server in the {{PodCallbackHandler}} implementation of > {{KubernetesResourceManager.}} > Here are steps to reproduce this kind of problems: > # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on > the K8s nodes via HostPath, make sure that the path exists but the > TaskManager process has no write permission. We can achieve this via the > [user-specified pod template > support|https://issues.apache.org/jira/browse/FLINK-15656] or just hardcode > it for testing only. > # Launch a session cluster > # Submit a new job to the session cluster, as expected, we can observe that > the Pod constantly fails quickly during launching the main Container, then > the {{KubernetesResourceManager#onModified}} is invoked to re-create a new > Pod immediately, without any speed control. > To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* > event and that Pod is terminated before successfully registering into the > {{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send > another creation request to K8s API Server immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)