Re: Jobmanager not properly fenced when killed by YARN RM

2019-12-16 Thread Yang Wang
Hi Paul, Thanks for sharing your analysis. I think you are right. When the Yarn NodeManager crashed, the first jobmanager running on it will not be killed. However, the Yarn ResourceManager found the NodeManager lost, it launched a new jobmanager attempt. Before FLINK-14010

Re: Jobmanager not properly fenced when killed by YARN RM

2019-12-16 Thread Paul Lam
Hi Yang, Thanks a lot for your reasoning. You are right about the YARN cluster. The NodeManager was crashed, and that’s why RM would kill the containers on that machine, after a heartbeat timeout (about 10 min) with the NodeManager. Actually the attached logs are from the first/old jobmanager,

Re: Jobmanager not properly fenced when killed by YARN RM

2019-12-16 Thread Yang Wang
Hi Paul, I found lots of "Failed to stop Container " logs in the jobmanager.log. It seems that the Yarn cluster is not working normally. So the Flink YarnResourceManager may also unregister app failed. If we unregister app successfully, no new attempt will be started. The second and following job

Re: Jobmanager not properly fenced when killed by YARN RM

2019-12-15 Thread Yang Wang
Hi Paul, I have gone through the codes and found that the root cause may be `YarnResourceManager` cleaned up the application staging directory. When it unregisters from the Yarn ResourceManager failed, a new attempt will be launched and failed quickly because of localization failed. I think it is