[ https://issues.apache.org/jira/browse/FLINK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831427#comment-16831427 ]
Zhenqiu Huang commented on FLINK-12342: --------------------------------------- As using the config and set it to 3000 milliseconds, the job with 256 containers can be successfully launched with only 1000+ total requested containers. The number can be further reduced by using larger number, such as 5000 or even higher. So, for small jobs with 32 containers, user should just default value for sending out request as soon as possible. For large jobs, user need to tune the parameter to trade-off the fast request and negative impact of repetitively as more containers. > Yarn Resource Manager Acquires Too Many Containers > -------------------------------------------------- > > Key: FLINK-12342 > URL: https://issues.apache.org/jira/browse/FLINK-12342 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN > Affects Versions: 1.6.4, 1.7.2, 1.8.0 > Environment: We runs job in Flink release 1.6.3. > Reporter: Zhenqiu Huang > Assignee: Zhenqiu Huang > Priority: Major > Labels: pull-request-available > Attachments: Screen Shot 2019-04-29 at 12.06.23 AM.png, > container.log, flink-1.4.png, flink-1.6.png > > Time Spent: 10m > Remaining Estimate: 0h > > In currently implementation of YarnFlinkResourceManager, it starts to acquire > new container one by one when get request from SlotManager. The mechanism > works when job is still, say less than 32 containers. If the job has 256 > container, containers can't be immediately allocated and appending requests > in AMRMClient will be not removed accordingly. We observe the situation that > AMRMClient ask for current pending request + 1 (the new request from slot > manager) containers. In this way, during the start time of such job, it asked > for 4000+ containers. If there is an external dependency issue happens, for > example hdfs access is slow. Then, the whole job will be blocked without > getting enough resource and finally killed with SlotManager request timeout. > Thus, we should use the total number of container asked rather than pending > request in AMRMClient as threshold to make decision whether we need to add > one more resource request. -- This message was sent by Atlassian JIRA (v7.6.3#76005)