Hi Paul,

I have gone through the codes and found that the root cause may be
`YarnResourceManager` cleaned
up the application staging directory. When it unregisters from the Yarn
ResourceManager failed, a
new attempt will be launched and failed quickly because of localization
failed.

I think it is a bug, and will happen when unregister application failed.
Could you share some logs of
jobmanager, the second or following attempt so that i could confirm the bug?


Best,
Yang

Paul Lam <paullin3...@gmail.com> 于2019年12月14日周六 上午11:02写道:

> Hi,
>
> Recently I've seen a situation when a JobManager received a stop signal
> from YARN RM but failed to exit and got in the restart loop, and keeps
> failing because the TaskManager containers are disconnected (killed by RM
> as well) before finally exited when hit the limit of the restart policy.
> This further resulted in the flink job being marked as final status failed
> and cleanup of  zookeeper paths, so when a new JobManager started up it
> found no checkpoint to restore and performed a stateless restart. In
> addition, the application is run with Flink 1.7.1 in HA job cluster mode on
> Hadoop 2.6.5.
>
> As I can remember, I've seen a similar issue that relates to the fencing
> of JobManager, but I searched the JIRA and couldn't find it. It would be
> great if someone can point me to the right direction. And any comments are
> also welcome! Thanks!
>
> Best,
> Paul Lam
>

Reply via email to