Hello everyone, I am testing High Availability of Flink on YARN on an AWS EMR cluster. My configuration is an EMR with one master-node and 3 core-nodes (each with 16 vCores). Zookeeper is running on all nodes. Yarn session was created with: flink-yarn-session -n 2 -s 8 -jm 1024m -tm 20g A job with parallelism of 16 was submitted.
I tried to execute the test by terminating the core-node (using Linux "init 0") having the job-manager running on. The first few restarts worked well - a new job-manager was elected, and the job was resumed properly. However, after some restarts, the new job-manager could not retrieve its needed resource any more (only one TM on the node with IP .81 was shown in the Task Managers GUI). <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Flink.png> I kept getting the error message "org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: Could not allocate all requires slots within timeout of 300000 ms. Slots required: 108, slots allocated: 60". Here below is what shown in YARN Resource Manager. <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Yarn.png> As per that screenshot, it looks like there are 2 tasks manager still running (one on each host .88 and .81), which means the one on .88 has not been cleaned properly. If it is, then how to clean it? I wonder whether when the server with JobManager crashes, the whole job is restarted, or a new JobManager will try to connect to the running TMs to resume the job? Thanks and regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/