Hi Paul,
    Could you check out your YARN property  
"yarn.resourcemanager.work-preserving-recovery.enabled"?
    if value is false, set true and try it again.
Best,
Devin

发件人: Paul Lam<mailto:paullin3...@gmail.com>
发送时间: 2018-11-13 12:55
收件人: Flink ML<mailto:user@flink.apache.org>
主题: What if not to keep containers across attempts in HA setup?(Internet mail)
Hi,

Recently I found a bug on our YARN cluster that crashes the standby RM during a 
RM failover, and
the bug is triggered by the keeping containers across attempts behavior of 
applications (see [1], a related
issue but the patch is not exactly the fix, because the problem is not on 
recovery, but the attempt after
the recovery).

Since YARN is a fundamental component and a maintenance of it would affect a 
lot users, as a last resort
I wonder if we could modify YarnClusterDescriptor and not to keep containers 
across attempts.

IMHO, Flink application’s state is not dependent on YARN, so there is no state 
that must be recovered
from the previous application attempt. In case of a application master failure, 
the taskmanagers can be
shutdown and the cost is longer recovery time.

Please correct me if I’m wrong. Thank you!

[1]https://issues.apache.org/jira/browse/YARN-2823

Best,
Paul Lam

Reply via email to