Zhenqiu Huang created FLINK-12342:
-------------------------------------

             Summary: Yarn Resource Manager Acquire Too Many Containers
                 Key: FLINK-12342
                 URL: https://issues.apache.org/jira/browse/FLINK-12342
             Project: Flink
          Issue Type: Improvement
          Components: Deployment / YARN
    Affects Versions: 1.8.0, 1.7.2, 1.6.4
            Reporter: Zhenqiu Huang
            Assignee: Zhenqiu Huang


In currently implementation of YarnFlinkResourceManager, it starts to acquire 
new container one by one when get request from SlotManager. The mechanism works 
when job is still, say less than 32 containers. If the job has 256 container, 
containers can't be immediately allocated and appending requests in AMRMClient 
will be not removed accordingly. We observe the situation that AMRMClient ask 
for current pending request + 1 (the new request from slot manager) containers. 
In this way, during the start time of such job, it asked for 4000+ containers. 
If there is an external dependency issue happens, for example hdfs access is 
slow. Then, the whole job will be blocked without getting enough resource and 
finally killed with SlotManager request timeout.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to