Hi, Recently I've seen a situation when a JobManager received a stop signal from YARN RM but failed to exit and got in the restart loop, and keeps failing because the TaskManager containers are disconnected (killed by RM as well) before finally exited when hit the limit of the restart policy. This further resulted in the flink job being marked as final status failed and cleanup of zookeeper paths, so when a new JobManager started up it found no checkpoint to restore and performed a stateless restart. In addition, the application is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.
As I can remember, I've seen a similar issue that relates to the fencing of JobManager, but I searched the JIRA and couldn't find it. It would be great if someone can point me to the right direction. And any comments are also welcome! Thanks! Best, Paul Lam