Hi Paul, I have gone through the codes and found that the root cause may be `YarnResourceManager` cleaned up the application staging directory. When it unregisters from the Yarn ResourceManager failed, a new attempt will be launched and failed quickly because of localization failed.
I think it is a bug, and will happen when unregister application failed. Could you share some logs of jobmanager, the second or following attempt so that i could confirm the bug? Best, Yang Paul Lam <paullin3...@gmail.com> 于2019年12月14日周六 上午11:02写道: > Hi, > > Recently I've seen a situation when a JobManager received a stop signal > from YARN RM but failed to exit and got in the restart loop, and keeps > failing because the TaskManager containers are disconnected (killed by RM > as well) before finally exited when hit the limit of the restart policy. > This further resulted in the flink job being marked as final status failed > and cleanup of zookeeper paths, so when a new JobManager started up it > found no checkpoint to restore and performed a stateless restart. In > addition, the application is run with Flink 1.7.1 in HA job cluster mode on > Hadoop 2.6.5. > > As I can remember, I've seen a similar issue that relates to the fencing > of JobManager, but I searched the JIRA and couldn't find it. It would be > great if someone can point me to the right direction. And any comments are > also welcome! Thanks! > > Best, > Paul Lam >