[jira] [Commented] (FLINK-12342) Yarn Resource Manager Acquires Too Many Containers

Till Rohrmann (JIRA) Sun, 28 Apr 2019 23:43:29 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-12342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16828949#comment-16828949
 ]


Till Rohrmann commented on FLINK-12342:
---------------------------------------

Before diving into the implementation, I would first like to fully understand 
the problem. Concretely a Yarn reference would be good which explains the 
behaviour. Otherwise we might simply fix a symptom or not the problem at all.

> Yarn Resource Manager Acquires Too Many Containers
> --------------------------------------------------
>
>                 Key: FLINK-12342
>                 URL: https://issues.apache.org/jira/browse/FLINK-12342
>             Project: Flink
>          Issue Type: Improvement
>          Components: Deployment / YARN
>    Affects Versions: 1.6.4, 1.7.2, 1.8.0
>         Environment: We runs job in Flink release 1.6.3. 
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> In currently implementation of YarnFlinkResourceManager, it starts to acquire 
> new container one by one when get request from SlotManager. The mechanism 
> works when job is still, say less than 32 containers. If the job has 256 
> container, containers can't be immediately allocated and appending requests 
> in AMRMClient will be not removed accordingly. We observe the situation that 
> AMRMClient ask for current pending request + 1 (the new request from slot 
> manager) containers. In this way, during the start time of such job, it asked 
> for 4000+ containers. If there is an external dependency issue happens, for 
> example hdfs access is slow. Then, the whole job will be blocked without 
> getting enough resource and finally killed with SlotManager request timeout.
> Thus, we should use the total number of container asked rather than pending 
> request in AMRMClient as threshold to make decision whether we need to add 
> one more resource request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-12342) Yarn Resource Manager Acquires Too Many Containers

Reply via email to