[jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers

ASF GitHub Bot (JIRA) Wed, 09 May 2018 10:02:14 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16469092#comment-16469092
 ]


ASF GitHub Bot commented on FLINK-9190:
---------------------------------------

Github user tillrohrmann commented on a diff in the pull request:

    https://github.com/apache/flink/pull/5931#discussion_r187106742
  
    --- Diff: 
flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java ---
    @@ -293,21 +293,16 @@ public void startNewWorker(ResourceProfile 
resourceProfile) {
        }
     
        @Override
    -   public boolean stopWorker(YarnWorkerNode workerNode) {
    -           if (workerNode != null) {
    -                   Container container = workerNode.getContainer();
    -                   log.info("Stopping container {}.", container.getId());
    -                   // release the container on the node manager
    -                   try {
    -                           
nodeManagerClient.stopContainer(container.getId(), container.getNodeId());
    -                   } catch (Throwable t) {
    -                           log.warn("Error while calling YARN Node Manager 
to stop container", t);
    -                   }
    -                   
resourceManagerClient.releaseAssignedContainer(container.getId());
    -                   workerNodeMap.remove(workerNode.getResourceID());
    -           } else {
    -                   log.error("Can not find container for null 
workerNode.");
    +   public boolean stopWorker(final YarnWorkerNode workerNode) {
    +           final Container container = workerNode.getContainer();
    +           log.info("Stopping container {}.", container.getId());
    +           try {
    +                   nodeManagerClient.stopContainer(container.getId(), 
container.getNodeId());
    +           } catch (final Exception e) {
    +                   log.warn("Error while calling YARN Node Manager to stop 
container", e);
                }
    +           
resourceManagerClient.releaseAssignedContainer(container.getId());
    +           workerNodeMap.remove(workerNode.getResourceID());
                return true;
    --- End diff --
    
    the `stopWorker` method should only be called from the main thread. 
Therefore, there should be no race condition.


> YarnResourceManager sometimes does not request new Containers
> -------------------------------------------------------------
>
>                 Key: FLINK-9190
>                 URL: https://issues.apache.org/jira/browse/FLINK-9190
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination, YARN
>    Affects Versions: 1.5.0
>         Environment: Hadoop 2.8.3
> ZooKeeper 3.4.5
> Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8
>            Reporter: Gary Yao
>            Assignee: Gary Yao
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>         Attachments: yarn-logs
>
>
> *Description*
> The {{YarnResourceManager}} does not request new containers if 
> {{TaskManagers}} are killed rapidly in succession. After 5 minutes the job is 
> restarted due to {{NoResourceAvailableException}}, and the job runs normally 
> afterwards. I suspect that {{TaskManager}} failures are not registered if the 
> failure occurs before the {{TaskManager}} registers with the master. Logs are 
> attached; I added additional log statements to 
> {{YarnResourceManager.onContainersCompleted}} and 
> {{YarnResourceManager.onContainersAllocated}}.
> *Expected Behavior*
> The {{YarnResourceManager}} should recognize that the container is completed 
> and keep requesting new containers. The job should run as soon as resources 
> are available. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-9190) YarnResourceManager sometimes does not request new Containers

Reply via email to