[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

Till Rohrmann (JIRA) Tue, 20 Nov 2018 01:13:34 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16692892#comment-16692892
 ]


Till Rohrmann commented on FLINK-10868:
---------------------------------------

Hi [~hpeter], I think this is not super trivially to achieve because to detect 
this situation properly, the RM needs a communication channel to the 
{{Dispatcher}} to tell him about the depleted resource requests. Moreover, we 
would need to fail all currently running jobs and wait for them to reach a 
global terminal state before we can shut down the cluster.

At the moment, Flink assumes that the RM can acquire at some point the 
requested resources and that it should retry in case of a TM failure. In which 
scenario would you like to stop retrying if there is a chance to regain the 
resources and finish your job?

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as 
> limit of resource acquirement
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (FLINK-10868) Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as limit of resource acquirement

Reply via email to