zhanglu153 created FLINK-30095:
----------------------------------

             Summary: Flink's JobCluster ResourceManager should throw an 
exception when the failure number of starting worker reaches the maximum 
failure rate
                 Key: FLINK-30095
                 URL: https://issues.apache.org/jira/browse/FLINK-30095
             Project: Flink
          Issue Type: Improvement
    Affects Versions: 1.16.0, 1.15.0, 1.14.0, 1.13.0
            Reporter: zhanglu153


As shown in https://issues.apache.org/jira/browse/FLINK-10868,although 
resourcemanager.start-worker.max-failure-rate and 
resourcemanager.start-worker.retry-interval are set, in a worse case, when new 
start containers consistently fail, YarnResourceManager will goes into an 
infinite resource acquirement process without failing the job. Resources on 
Yarn are continuously occupied and released after a period of time, affecting 
other tasks.

It should be considered that when the failure number of starting worker reaches 
the maximum failure rate, Flink JobCluster ResourceManager will directly throw 
an exception instead of sending a new request to start new worker after a 
period of time. This task does not fail but is always in the running state. 
Users may not be aware that tasks occupy resources on yarn in a timely manner, 
which affects other tasks' failure to obtain resources on yarn.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to