[ 
https://issues.apache.org/jira/browse/FLINK-10868?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhenqiu Huang updated FLINK-10868:
----------------------------------
    Description: Currently, YarnResourceManager does use 
yarn.maximum-failed-containers as limit of resource acquirement. In worse case, 
when new start containers consistently fail, YarnResourceManager will goes into 
an infinite resource acquirement process without failing the job. Together with 
the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
resources of yarn queue.  (was: Currently, YarnResourceManager does use 
yarn.maximum-failed-containers as limit of resource acquirement. In worse case, 
when new start containers consistently fail, YarnResourceManager will goes into 
an infinite resource acquirement process without failing the job. Together with 
the https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
resources of yarn queue.

In production, we observe the following that a task manager failed in HA 
enabled Flink job. At the same time, there is a hdfs failover. During that 
period, Operation category READ is not supported in state standby. Thus, new 
acquired task managers keep on failure. )

> Flink's Yarn ResourceManager doesn't use yarn.maximum-failed-containers as 
> limit of resource acquirement
> --------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10868
>                 URL: https://issues.apache.org/jira/browse/FLINK-10868
>             Project: Flink
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Zhenqiu Huang
>            Assignee: Zhenqiu Huang
>            Priority: Major
>
> Currently, YarnResourceManager does use yarn.maximum-failed-containers as 
> limit of resource acquirement. In worse case, when new start containers 
> consistently fail, YarnResourceManager will goes into an infinite resource 
> acquirement process without failing the job. Together with the 
> https://issues.apache.org/jira/browse/FLINK-10848, It will quick occupy all 
> resources of yarn queue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to