[ https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457261#comment-16457261 ]
ASF GitHub Bot commented on FLINK-9190: --------------------------------------- Github user sihuazhou commented on a diff in the pull request: https://github.com/apache/flink/pull/5931#discussion_r184831311 --- Diff: flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java --- @@ -293,21 +293,16 @@ public void startNewWorker(ResourceProfile resourceProfile) { } @Override - public boolean stopWorker(YarnWorkerNode workerNode) { - if (workerNode != null) { - Container container = workerNode.getContainer(); - log.info("Stopping container {}.", container.getId()); - // release the container on the node manager - try { - nodeManagerClient.stopContainer(container.getId(), container.getNodeId()); - } catch (Throwable t) { - log.warn("Error while calling YARN Node Manager to stop container", t); - } - resourceManagerClient.releaseAssignedContainer(container.getId()); - workerNodeMap.remove(workerNode.getResourceID()); - } else { - log.error("Can not find container for null workerNode."); + public boolean stopWorker(final YarnWorkerNode workerNode) { + final Container container = workerNode.getContainer(); + log.info("Stopping container {}.", container.getId()); + try { + nodeManagerClient.stopContainer(container.getId(), container.getNodeId()); + } catch (final Exception e) { + log.warn("Error while calling YARN Node Manager to stop container", e); --- End diff -- Previous version was the `Throwable` here, currently changed to `Exception`, what the reason here? Beside, if we change here from `Throwable` -> `Exception` then maybe we should also change the other places where have a similar operation with `nodeManagerClient` like here, e.g. ![image](https://user-images.githubusercontent.com/7480427/39389518-e609686c-4abb-11e8-99df-b0cd013a58ef.png) > YarnResourceManager sometimes does not request new Containers > ------------------------------------------------------------- > > Key: FLINK-9190 > URL: https://issues.apache.org/jira/browse/FLINK-9190 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, YARN > Affects Versions: 1.5.0 > Environment: Hadoop 2.8.3 > ZooKeeper 3.4.5 > Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8 > Reporter: Gary Yao > Assignee: Gary Yao > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > Attachments: yarn-logs > > > *Description* > The {{YarnResourceManager}} does not request new containers if > {{TaskManagers}} are killed rapidly in succession. After 5 minutes the job is > restarted due to {{NoResourceAvailableException}}, and the job runs normally > afterwards. I suspect that {{TaskManager}} failures are not registered if the > failure occurs before the {{TaskManager}} registers with the master. Logs are > attached; I added additional log statements to > {{YarnResourceManager.onContainersCompleted}} and > {{YarnResourceManager.onContainersAllocated}}. > *Expected Behavior* > The {{YarnResourceManager}} should recognize that the container is completed > and keep requesting new containers. The job should run as soon as resources > are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005)