Hi, I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started a job with parallelism = 32 and waited for a few seconds so that all nodes are doing work. I then shut down the node that had the leader job manager, and by shut down I mean I powered off the virtual machine running it. I monitored the logs to see what was going on and I saw that zookeeper has elected a new leader. I also saw a log for recovering jobs, but nothing actually happens. Here’s the job manager log from the node that became the leader:
11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager - JobManager akka.tcp://[email protected]:56023/user/jobmanager was granted leadership with leader session ID Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever - New leader reachable under akka.tcp://[email protected]:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.174 (akka.tcp://[email protected]:52324/user/taskmanager) as e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. Current number of alive task slots is 16. 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager - Registered TaskManager at 192.168.200.175 (akka.tcp://[email protected]:46612/user/taskmanager) as 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. Current number of alive task slots is 32. 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager - Recovering all jobs. I waited 10 minutes after that last log and there was no change. And here’s the task-manager log from the same node: 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager - Trying to register at JobManager akka.tcp://[email protected]:56023/user/jobmanager (attempt 1, timeout: 500 milliseconds) 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager - Successful registration at JobManager (akka.tcp://[email protected]:56023/user/jobmanager), starting network stack and library cache. 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful initialization (took 4 ms). 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 6 ms). Listening on SocketAddress /192.168.200.174:39322. 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB cache. 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache - Created BLOB cache storage directory /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e Is this a bug? Thanks, Ali
