Hi Yang,

Thanks a lot for your reasoning. You are right about the YARN cluster. The 
NodeManager was crashed, and that’s why RM would kill the containers on that 
machine, after a heartbeat timeout (about 10 min) with the NodeManager.

Actually the attached logs are from the first/old jobmanager, and I couldn’t 
the log about YARN application unregistration in all logs. I think maybe Flink 
resource manager was not trying to unregister the application (which would also 
remove the HA service state) when it got a shutdown request, because the Flink 
job runs well at the moment. 

I dug a bit deeper to find that the root cause might be that Flink 
ResoureManager was taking too long to shutdown and it didn’t change the Flink 
job status (so JobManager would keep working even the AM is killed by RM). And 
also I’ve found the related issue mentioned in my previous mail. [1]

Thanks a lot for you help!

[1] https://issues.apache.org/jira/browse/FLINK-14010 
<https://issues.apache.org/jira/browse/FLINK-14010>

Best,
Paul Lam

> 在 2019年12月16日,20:35,Yang Wang <danrtsey...@gmail.com> 写道:
> 
> Hi Paul,
> 
> I found lots of "Failed to stop Container " logs in the jobmanager.log. It 
> seems that the Yarn cluster
> is not working normally. So the Flink YarnResourceManager may also unregister 
> app failed. If we
> unregister app successfully, no new attempt will be started.
> 
> The second and following jobmanager attempt started and failed because the 
> staging directory
> on HDFS was cleaned up. So could you find the log "Could not unregister the 
> application master"
> in all the jobmanager logs, including the first one?
> 
> 
> 
> Best,
> Yang
> 
> Paul Lam <paullin3...@gmail.com <mailto:paullin3...@gmail.com>> 
> 于2019年12月16日周一 下午7:56写道:
> Hi Yang,
> 
> Thanks a lot for your reply!
> 
> I might not make myself clear, but the new jobmanager in the new YARN 
> application attempt did start successfully. And unluckily, I didn’t find any 
> logs written by YarnResourceManager in the jobmanager logs.  
> 
> The jobmanager logs are in the attachment (with some subtask status change 
> logs removed and env info censored).
> 
> Thanks!
> 
> 
> Best,
> Paul Lam
> 
>> 在 2019年12月16日,14:42,Yang Wang <danrtsey...@gmail.com 
>> <mailto:danrtsey...@gmail.com>> 写道:
>> 
>> Hi Paul,
>> 
>> I have gone through the codes and found that the root cause may be 
>> `YarnResourceManager` cleaned
>> up the application staging directory. When it unregisters from the Yarn 
>> ResourceManager failed, a
>> new attempt will be launched and failed quickly because of localization 
>> failed.
>> 
>> I think it is a bug, and will happen when unregister application failed. 
>> Could you share some logs of 
>> jobmanager, the second or following attempt so that i could confirm the bug?
>> 
>> 
>> Best,
>> Yang
>> 
>> Paul Lam <paullin3...@gmail.com <mailto:paullin3...@gmail.com>> 
>> 于2019年12月14日周六 上午11:02写道:
>> Hi,
>> 
>> Recently I've seen a situation when a JobManager received a stop signal from 
>> YARN RM but failed to exit and got in the restart loop, and keeps failing 
>> because the TaskManager containers are disconnected (killed by RM as well) 
>> before finally exited when hit the limit of the restart policy. This further 
>> resulted in the flink job being marked as final status failed and cleanup of 
>>  zookeeper paths, so when a new JobManager started up it found no checkpoint 
>> to restore and performed a stateless restart. In addition, the application 
>> is run with Flink 1.7.1 in HA job cluster mode on Hadoop 2.6.5.
>> 
>> As I can remember, I've seen a similar issue that relates to the fencing of 
>> JobManager, but I searched the JIRA and couldn't find it. It would be great 
>> if someone can point me to the right direction. And any comments are also 
>> welcome! Thanks!
>> 
>> Best,
>> Paul Lam
> 

Reply via email to