tillrohrmann commented on a change in pull request #11323:
URL: https://github.com/apache/flink/pull/11323#discussion_r413767396



##########
File path: 
flink-kubernetes/src/main/java/org/apache/flink/kubernetes/KubernetesResourceManager.java
##########
@@ -320,5 +333,16 @@ private void internalStopPod(String podName) {
                                        }
                                }
                        );
+
+               final KubernetesWorkerNode kubernetesWorkerNode = 
workerNodes.remove(resourceId);
+               final WorkerResourceSpec workerResourceSpec = 
podWorkerResources.remove(podName);
+
+               // If the stopped pod is requested in the current attempt 
(workerResourceSpec is known) and is not yet added,
+               // we need to notify ActiveResourceManager to decrease the 
pending worker count.
+               if (workerResourceSpec != null && kubernetesWorkerNode == null) 
{

Review comment:
       But what we want to do is to restart a recovered pod if it fails and it 
is needed. Assume that our cluster recovers a pod from a previous attempt. Then 
we submit a job and the job uses slots of this pod. Now the pod fails. Unlike 
in the case of a started pod, Flink won't try to restart this pod. I think 
there should not be a difference between in the failure handling between 
started and recovered pods.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to