Hi, Recently I found a bug on our YARN cluster that crashes the standby RM during a RM failover, and the bug is triggered by the keeping containers across attempts behavior of applications (see [1], a related issue but the patch is not exactly the fix, because the problem is not on recovery, but the attempt after the recovery).
Since YARN is a fundamental component and a maintenance of it would affect a lot users, as a last resort I wonder if we could modify YarnClusterDescriptor and not to keep containers across attempts. IMHO, Flink application’s state is not dependent on YARN, so there is no state that must be recovered from the previous application attempt. In case of a application master failure, the taskmanagers can be shutdown and the cost is longer recovery time. Please correct me if I’m wrong. Thank you! [1]https://issues.apache.org/jira/browse/YARN-2823 <https://issues.apache.org/jira/browse/YARN-2823> Best, Paul Lam