[ https://issues.apache.org/jira/browse/FLINK-9190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16456605#comment-16456605 ]
ASF GitHub Bot commented on FLINK-9190: --------------------------------------- GitHub user GJL opened a pull request: https://github.com/apache/flink/pull/5931 [FLINK-9190][flip6,yarn] Request new container if container completed unexpectedly ## What is the purpose of the change *Request new YARN container if container completed unexpectedly.* cc: @sihuazhou @StephanEwen @tillrohrmann ## Brief change log - *Request new container if container completed unexpectedly.* - *Reduce visibility of some fields in `YarnResourceManager`.* ## Verifying this change This change added tests and can be verified as follows: - *Manually verified the change by deploying a Flink cluster on YARN and killing `TaskExecutorRunner`s randomly.* ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (yes / **no**) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**) - The serializers: (yes / **no** / don't know) - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (**yes** / no / don't know) - The S3 file system connector: (yes / **no** / don't know) ## Documentation - Does this pull request introduce a new feature? (yes / **no**) - If yes, how is the feature documented? (**not applicable** / docs / JavaDocs / not documented) You can merge this pull request into a Git repository by running: $ git pull https://github.com/GJL/flink FLINK-9190 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/5931.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #5931 ---- commit 35b02327fcbcb9a7fed3ad162e26f9900c774558 Author: gyao <gary@...> Date: 2018-04-27T13:49:31Z [FLINK-9190][flip6,yarn] Request new container if container completed unexpectedly. commit 3d02f3c171a4473b25377c2319506901228ff8f3 Author: gyao <gary@...> Date: 2018-04-27T13:51:38Z [hotfix][yarn] Reduce visibility of fields. ---- > YarnResourceManager sometimes does not request new Containers > ------------------------------------------------------------- > > Key: FLINK-9190 > URL: https://issues.apache.org/jira/browse/FLINK-9190 > Project: Flink > Issue Type: Bug > Components: Distributed Coordination, YARN > Affects Versions: 1.5.0 > Environment: Hadoop 2.8.3 > ZooKeeper 3.4.5 > Flink 71c3cd2781d36e0a03d022a38cc4503d343f7ff8 > Reporter: Gary Yao > Assignee: Gary Yao > Priority: Blocker > Labels: flip-6 > Fix For: 1.5.0 > > Attachments: yarn-logs > > > *Description* > The {{YarnResourceManager}} does not request new containers if > {{TaskManagers}} are killed rapidly in succession. After 5 minutes the job is > restarted due to {{NoResourceAvailableException}}, and the job runs normally > afterwards. I suspect that {{TaskManager}} failures are not registered if the > failure occurs before the {{TaskManager}} registers with the master. Logs are > attached; I added additional log statements to > {{YarnResourceManager.onContainersCompleted}} and > {{YarnResourceManager.onContainersAllocated}}. > *Expected Behavior* > The {{YarnResourceManager}} should recognize that the container is completed > and keep requesting new containers. The job should run as soon as resources > are available. -- This message was sent by Atlassian JIRA (v7.6.3#76005)