[ https://issues.apache.org/jira/browse/FLINK-9567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16529560#comment-16529560 ]
ASF GitHub Bot commented on FLINK-9567: --------------------------------------- GitHub user Clarkkkkk opened a pull request: https://github.com/apache/flink/pull/6237 [FLINK-9567][runtime][yarn] Fix the bug that Flink does not release YarnContainer in some cases ## What is the purpose of the change - This pull request responds to [JIRA issue FLINK-9567](https://issues.apache.org/jira/browse/FLINK-9567). - This pull request is to avoid flink system from not releasing excessive container when Yarn Callback onContainerCompleted was called after a full restart. ## Brief change log - Modify the onContainerCompleted method in YarnResourceManager. - Add a getNumberPendingSlotRequests in SlotManager that check how many pending slot requests are not fulfilled - Add a getNumberPendingSlotRequests in ResourceManager that get pending slot requests from SlotManager ## Verifying this change This change is covered by testOnContainerCompleted added in YarnResourceManagerTest ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with @Public(Evolving): (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (No) You can merge this pull request into a Git repository by running: $ git pull https://github.com/Clarkkkkk/flink master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6237 ---- commit 92086f1c56d9d170619fae170aed092e075c7c63 Author: yangshimin <yangshimin@...> Date: 2018-07-02T03:56:00Z [FLINK-9567][runtime][yarn] Fix the bug that Flink does not release Yarn container when onContainerCompleted callback method happened after full restart a [FLINK-9567][runtime][yarn] Fix the bug that Flink does not release Yarn container when onContainerCompleted callback method happened after full restart ---- > Flink does not release resource in Yarn Cluster mode > ---------------------------------------------------- > > Key: FLINK-9567 > URL: https://issues.apache.org/jira/browse/FLINK-9567 > Project: Flink > Issue Type: Bug > Components: Cluster Management, YARN > Affects Versions: 1.5.0 > Reporter: Shimin Yang > Assignee: Shimin Yang > Priority: Critical > Labels: pull-request-available > Fix For: 1.6.0 > > Attachments: FlinkYarnProblem, fulllog.txt > > > After restart the Job Manager in Yarn Cluster mode, sometimes Flink does not > release task manager containers in some specific case. In the worst case, I > had a job configured to 5 task managers, but possess more than 100 containers > in the end. Although the task didn't failed, but it affect other jobs in the > Yarn Cluster. > In the first log I posted, the container with id 24 is the reason why Yarn > did not release resources. As the container was killed before restart, but it > has not received the callback of *onContainerComplete* in > *YarnResourceManager* which should be called by *AMRMAsyncClient* of Yarn. > After restart, as we can see in line 347 of FlinkYarnProblem log, > 2018-06-14 22:50:47,846 WARN akka.remote.ReliableDeliverySupervisor - > Association with remote system [akka.tcp://flink@bd-r1hdp69:30609] has > failed, address is now gated for [50] ms. Reason: [Disassociated] > Flink lost the connection of container 24 which is on bd-r1hdp69 machine. > When it try to call *closeTaskManagerConnection* in *onContainerComplete*, it > did not has the connection to TaskManager on container 24, so it just ignore > the close of TaskManger. > 2018-06-14 22:50:51,812 DEBUG org.apache.flink.yarn.YarnResourceManager - No > open TaskExecutor connection container_1528707394163_29461_02_000024. > Ignoring close TaskExecutor connection. > However, bafore calling *closeTaskManagerConnection,* it already called > *requestYarnContainer* which lead to *numPendingContainerRequests variable > in* *YarnResourceManager* increased by 1. > As the excessive container return is determined by the > *numPendingContainerRequests* variable in *YarnResourceManager*, it cannot > return this container although it is not required. Meanwhile, the restart > logic has already allocated enough containers for Task Managers, Flink will > possess the extra container for a long time for nothing. > In the full log, the job ended with 7 containers while only 3 are running > TaskManagers. > ps: Another strange thing I found is that when sometimes request for a yarn > container, it will return much more than requested. Is it a normal scenario > for AMRMAsyncClient? -- This message was sent by Atlassian JIRA (v7.6.3#76005)