[ https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16445315#comment-16445315 ]
ASF GitHub Bot commented on FLINK-9190: --------------------------------------- GitHub user sihuazhou opened a pull request: https://github.com/apache/flink/pull/5881 [FLINK-9190][yarn] fix YarnResourceManager sometimes does not request new Containers ## What is the purpose of the change This PR fixes the problem that `YarnResourceManager` does not request new Containers when container were killed without registering with `ResourceManager`. ## Brief change log - *fix YarnResourceManager sometimes does not request new Containers* ## Verifying this change - *add unit test to `YarnResourceManagerTest#testKillContainerBeforeTMRegisterSuccessfully()` verify this* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (*yes*) - The S3 file system connector: (no) ## Documentation no You can merge this pull request into a Git repository by running: $ git pull https://github.com/sihuazhou/flink fixYarnResourceManagerRequestContainers Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5881.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5881 ---- commit bbf03ca7fc709e11627560466bff01b9e750bbd2 Author: sihuazhou <summerleafs@...> Date: 2018-04-20T05:02:28Z fix YarnResourceManager sometimes does not request new Containers ---- > YarnResourceManager sometimes does not request new Containers > ------------------------------------------------------------- > > Key: FLINK-9190 > URL: https://issues.apache.org/jira/browse/FLINK-9190 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, YARN > Affects Versions: 1.5.0 > Environment: Hadoop 2.8.3 > ZooKeeper 3.4.5 > Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8 > Reporter: Gary Yao > Assignee: Sihua Zhou > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > Attachments: yarn-logs > > > *Description* > The {{YarnResourceManager}} does not request new containers if > {{TaskManagers}} are killed rapidly in succession. After 5 minutes the job is > restarted due to {{NoResourceAvailableException}}, and the job runs normally > afterwards. I suspect that {{TaskManager}} failures are not registered if the > failure occurs before the {{TaskManager}} registers with the master. Logs are > attached; I added additional log statements to > {{YarnResourceManager.onContainersCompleted}} and > {{YarnResourceManager.onContainersAllocated}}. > *Expected Behavior* > The {{YarnResourceManager}} should recognize that the container is completed > and keep requesting new containers. The job should run as soon as resources > are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005)