[jira] [Updated] (FLINK-17176) Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler

Canbin Zheng (Jira) Thu, 16 Apr 2020 02:41:04 -0700


     [ 
https://issues.apache.org/jira/browse/FLINK-17176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Canbin Zheng updated FLINK-17176:
---------------------------------
    Description: 
In the native K8s setups, there are some cases that we do not control the speed 
of pod re-creation which poses potential risks to flood the K8s API Server in 
the {{PodCallbackHandler}} implementation of {{KubernetesResourceManager.}}

Here are steps to reproduce this kind of problems:
 # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on 
the K8s nodes via HostPath, make sure that the path exists but the TaskManager 
process has no write permission. We can achieve this via the [user-specified 
pod template support|https://issues.apache.org/jira/browse/FLINK-15656] or just 
hardcode it for testing only.
 # Launch a session cluster
 # Submit a new job to the session cluster, as expected, we can observe that 
the Pod constantly fails quickly during launching the main Container, then the 
{{KubernetesResourceManager#onModified}} is invoked to re-create a new Pod 
immediately, without any speed control.

To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* event 
and that Pod is terminated before successfully registering into the 
{{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send 
another creation request to K8s API Server immediately.

  was:
In the native K8s setups, there are some cases that we do not control the speed 
of pod re-creation which poses potential risks to flood the K8s API Server in 
the {{PodCallbackHandler}} implementation of {{KubernetesResourceManager.}}

Here are steps to reproduce this kind of problems:
 # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on 
the K8s nodes via HostPath, make sure that the path exists but the TaskManager 
process has no write permission. We can achieve this via the user-specified pod 
template support or just hardcode it for testing only.
 # Launch a session cluster
 # Submit a new job to the session cluster, as expected, we can observe that 
the Pod constantly fails quickly during launching the main Container, then the 
{{KubernetesResourceManager#onModified}} is invoked to re-create a new Pod 
immediately, without any speed control.

To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* event 
and that Pod is terminated before successfully registering into the 
{{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send 
another creation request to K8s API Server immediately.


> Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler
> -------------------------------------------------------------------------
>
>                 Key: FLINK-17176
>                 URL: https://issues.apache.org/jira/browse/FLINK-17176
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.10.0
>            Reporter: Canbin Zheng
>            Priority: Major
>             Fix For: 1.11.0
>
>
> In the native K8s setups, there are some cases that we do not control the 
> speed of pod re-creation which poses potential risks to flood the K8s API 
> Server in the {{PodCallbackHandler}} implementation of 
> {{KubernetesResourceManager.}}
> Here are steps to reproduce this kind of problems:
>  # Mount the {{/opt/flink/log}} in the Container of TaskManager to a path on 
> the K8s nodes via HostPath, make sure that the path exists but the 
> TaskManager process has no write permission. We can achieve this via the 
> [user-specified pod template 
> support|https://issues.apache.org/jira/browse/FLINK-15656] or just hardcode 
> it for testing only.
>  # Launch a session cluster
>  # Submit a new job to the session cluster, as expected, we can observe that 
> the Pod constantly fails quickly during launching the main Container, then 
> the {{KubernetesResourceManager#onModified}} is invoked to re-create a new 
> Pod immediately, without any speed control.
> To sum up, once the {{KubernetesResourceManager}} receives the Pod *ADD* 
> event and that Pod is terminated before successfully registering into the 
> {{KubernetesResourceManager}}, the {{KubernetesResourceManager}} will send 
> another creation request to K8s API Server immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (FLINK-17176) Slow down Pod re-creation in KubernetesResourceManager#PodCallbackHandler

Reply via email to