Hi Gary, I faced a similar problem yesterday, but don't know what was the cause yet. The situation that I observed is as follow: - At about 2:57, one of my EMR execution node (IP ...99) got disconnected from YARN resource manager (on RM I could not see that node anymore), despite that the node was still running. <<< This is another issue, but I believe it is with YARN. - About 8 hours after that (between 10:00 - 11:00), I turned the problematic EMR core node off. AWS spun up another node and added it to the cluster to replace that. YARN RM soon recognized the new node and added it to its list of available nodes. However, the JM seemed to not (able to) do anything after that. It kept trying to start the job, failed after the timeout and that "no resource available" exception again and again. No jobmanager logs recorded since 2:57:15 though.
I am attaching the logs collected via "yarn logs --applicationId <appId> here. But it seems I still missed something. I am using Flink 1.7.1, with yarn-site configuration yarn.resourcemanager.am.max-attempts=5. Flink configurations are all of the default values. Thanks and best regards, Averell flink.log <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/flink.log> -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/