Hello everyone,

I am testing High Availability of Flink on YARN on an AWS EMR cluster.
My configuration is an EMR with one master-node and 3 core-nodes (each with
16 vCores). Zookeeper is running on all nodes.
Yarn session was created with: flink-yarn-session -n 2 -s 8 -jm 1024m -tm
20g
A job with parallelism of 16 was submitted.

I tried to execute the test by terminating the core-node (using Linux "init
0") having the job-manager running on. The first few restarts worked well -
a new job-manager was elected, and the job was resumed properly.
However, after some restarts, the new job-manager could not retrieve its
needed resource any more (only one TM on the node with IP .81 was shown in
the Task Managers GUI). 
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Flink.png>
 

I kept getting the error message
"org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
Could not allocate all requires slots within timeout of 300000 ms. Slots
required: 108, slots allocated: 60".

Here below is what shown in YARN Resource Manager.
<http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t1586/Yarn.png>
 

As per that screenshot, it looks like there are 2 tasks manager still
running (one on each host .88 and .81), which means the one on .88 has not
been cleaned properly. If it is, then how to clean it?

I wonder whether when the server with JobManager crashes, the whole job is
restarted, or a new JobManager will try to connect to the running TMs to
resume the job?


Thanks and regards,
Averell

 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to