Github user sihuazhou commented on the issue: https://github.com/apache/flink/pull/5931 Hi @GJL , is it possible that the reason is the same as in the previous PR for this ticket, that is even the container setup successfully and connect with ResourceManager successfully, but the TM was killed before connecting to JobManager successfully. In this case, even though there are enough TMs, JobManager won't fire any new request, and the ResourceManager doesn't know that the container it assigned to JobManager has been killed either, so both JobManager & ResourceManager won't do anything but waiting for timeout... What do you think?
---