[ https://issues.apache.org/jira/browse/FLINK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16828949#comment-16828949 ]
Till Rohrmann commented on FLINK-12342: --------------------------------------- Before diving into the implementation, I would first like to fully understand the problem. Concretely a Yarn reference would be good which explains the behaviour. Otherwise we might simply fix a symptom or not the problem at all. > Yarn Resource Manager Acquires Too Many Containers > -------------------------------------------------- > > Key: FLINK-12342 > URL: https://issues.apache.org/jira/browse/FLINK-12342 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN > Affects Versions: 1.6.4, 1.7.2, 1.8.0 > Environment: We runs job in Flink release 1.6.3. > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Critical > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > In currently implementation of YarnFlinkResourceManager, it starts to acquire > new container one by one when get request from SlotManager. The mechanism > works when job is still, say less than 32 containers. If the job has 256 > container, containers can't be immediately allocated and appending requests > in AMRMClient will be not removed accordingly. We observe the situation that > AMRMClient ask for current pending request + 1 (the new request from slot > manager) containers. In this way, during the start time of such job, it asked > for 4000+ containers. If there is an external dependency issue happens, for > example hdfs access is slow. Then, the whole job will be blocked without > getting enough resource and finally killed with SlotManager request timeout. > Thus, we should use the total number of container asked rather than pending > request in AMRMClient as threshold to make decision whether we need to add > one more resource request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)