KarmaGYZ commented on a change in pull request #11248: [FLINK-16299] Release containers recovered from previous attempt in w… URL: https://github.com/apache/flink/pull/11248#discussion_r386859531
########## File path: flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java ########## @@ -464,7 +472,15 @@ public void onContainerStarted(ContainerId containerId, Map<String, ByteBuffer> @Override public void onContainerStatusReceived(ContainerId containerId, ContainerStatus containerStatus) { - // We are not interested in getting container status + // We fetch the status of the container from the previous attempts. + if (containerStatus.getState() == ContainerState.NEW) { Review comment: I just have an offline discussion with Xintong and Taoyang. The key point of this issue is: we need to release the container which is not started in the previous attempt or is not useable anymore. When we try to get the status of those containers: - If it goes into `onContainerStatusReceived`: It means that the container has been started in the previous attempt. No matter what is the state, these containers will be released eventually by the existing `onStartContainerError` or `onContainersCompleted`. There is no need to handle them. - If it goes into `onGetContainerStatusError`: It means that the container is not started in the previous attempt or other causes like NM lost. In such cases, the container is not useable and we need to release it and remove it from the `workerNodeMap`. So, I will move the release logic here to onGetContainerStatusError. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services