Hi,

Recently I've seen a situation when a JobManager received a stop signal
from YARN RM but failed to exit and got in the restart loop, and keeps
failing because the TaskManager containers are disconnected (killed by RM
as well) before finally exited when hit the limit of the restart policy.
This further resulted in the flink job being marked as final status failed
and cleanup of  zookeeper paths, so when a new JobManager started up it
found no checkpoint to restore and performed a stateless restart. In
addition, the application is run with Flink 1.7.1 in HA job cluster mode on
Hadoop 2.6.5.

As I can remember, I've seen a similar issue that relates to the fencing of
JobManager, but I searched the JIRA and couldn't find it. It would be great
if someone can point me to the right direction. And any comments are also
welcome! Thanks!

Best,
Paul Lam

Reply via email to