Hi Devin, Thanks for your reasoning! It’s consistent with my observation, and I fully agree with you.
Maybe we should create an issue for the Hadoop community if it is not fixed in the master branch. Best, Paul Lam > 在 2018年11月15日,11:59,devinduan(段丁瑞) <devind...@tencent.com> 写道: > > Hi Paul: > I have reviewed hadoop & Flink code. > Flink setKeepContainersAcrossApplicationAttempts to true if you set flink > config high-availability to true. > <Catch.jpg> > If you set yarn.resourcemanager.work-preserving-recovery.enabled false, > AM(JobManager) will be killed by ResouceManager and start anthoer AM when > failover. > Flink setKeepContainersAcrossApplicationAttempts to true cause am start > from previous attempt. > <CatchD351.jpg> > But current ResourceManager is new active, application is not set > AppAttempt. > So you will see NPE exception. > I think hadoop comunnity should resolve this issue. >> Best, >> Devin > > > > > > 发件人: Paul Lam <mailto:paullin3...@gmail.com> > 发送时间: 2018-11-15 11:31 > 收件人: devinduan(段丁瑞) <mailto:devind...@tencent.com> > 抄送: user <mailto:user@flink.apache.org> > 主题: Re: What if not to keep containers across attempts in HA setup?(Internet > mail) > Hi Devin, > > Thanks for the pointer and it works! > > But I have no permission to change the YARN conf in production environment by > myself and it would need an detailed > investigation of the Hadoop team to apply the new conf, so I’m still > interested in the difference between keeping and > not keeping containers across application attempts. > > Best, > Paul Lam > > >> 在 2018年11月13日,17:27,devinduan(段丁瑞) <devind...@tencent.com >> <mailto:devind...@tencent.com>> 写道: >> >> Hi Paul, >> Could you check out your YARN property >> "yarn.resourcemanager.work-preserving-recovery.enabled"? >> if value is false, set true and try it again. >> Best, >> Devin >> >> 发件人: Paul Lam <mailto:paullin3...@gmail.com> >> 发送时间: 2018-11-13 12:55 >> 收件人: Flink ML <mailto:user@flink.apache.org> >> 主题: What if not to keep containers across attempts in HA setup?(Internet >> mail) >> Hi, >> >> Recently I found a bug on our YARN cluster that crashes the standby RM >> during a RM failover, and >> the bug is triggered by the keeping containers across attempts behavior of >> applications (see [1], a related >> issue but the patch is not exactly the fix, because the problem is not on >> recovery, but the attempt after >> the recovery). >> >> Since YARN is a fundamental component and a maintenance of it would affect a >> lot users, as a last resort >> I wonder if we could modify YarnClusterDescriptor and not to keep containers >> across attempts. >> >> IMHO, Flink application’s state is not dependent on YARN, so there is no >> state that must be recovered >> from the previous application attempt. In case of a application master >> failure, the taskmanagers can be >> shutdown and the cost is longer recovery time. >> >> Please correct me if I’m wrong. Thank you! >> >> [1]https://issues.apache.org/jira/browse/YARN-2823 >> <https://issues.apache.org/jira/browse/YARN-2823> >> >> Best, >> Paul Lam