No job recovery after job manager failure

Kashmar, Ali Wed, 16 Dec 2015 08:28:25 -0800

Hi,

I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I started 
a job with parallelism = 32 and waited for a few seconds so that all nodes are 
doing work. I then shut down the node that had the leader job manager, and by 
shut down I mean I powered off the virtual machine running it. I monitored the 
logs to see what was going on and I saw that zookeeper has elected a new 
leader. I also saw a log for recovering jobs, but nothing actually happens. 
Here’s the job manager log from the node that became the leader:


11:06:43,448 INFO  org.apache.flink.runtime.jobmanager.JobManager               
 - JobManager akka.tcp://[email protected]:56023/user/jobmanager was 
granted leadership with leader session ID 
Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
11:06:45,912 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever      
 - New leader reachable under 
akka.tcp://[email protected]:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
11:06:45,963 INFO  org.apache.flink.runtime.instance.InstanceManager            
 - Registered TaskManager at 192.168.200.174 
(akka.tcp://[email protected]:52324/user/taskmanager) as 
e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. 
Current number of alive task slots is 16.
11:06:45,975 INFO  org.apache.flink.runtime.instance.InstanceManager            
 - Registered TaskManager at 192.168.200.175 
(akka.tcp://[email protected]:46612/user/taskmanager) as 
766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. 
Current number of alive task slots is 32.
11:08:25,925 INFO  org.apache.flink.runtime.jobmanager.JobManager               
 - Recovering all jobs.


I waited 10 minutes after that last log and there was no change. And here’s the 
task-manager log from the same node:


11:06:45,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
 - Trying to register at JobManager 
akka.tcp://[email protected]:56023/user/jobmanager (attempt 1, timeout: 500 
milliseconds)
11:06:45,983 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
 - Successful registration at JobManager 
(akka.tcp://[email protected]:56023/user/jobmanager), starting network 
stack and library cache.
11:06:45,988 INFO  org.apache.flink.runtime.io.network.netty.NettyClient        
 - Successful initialization (took 4 ms).
11:06:45,994 INFO  org.apache.flink.runtime.io.network.netty.NettyServer        
 - Successful initialization (took 6 ms). Listening on SocketAddress 
/192.168.200.174:39322.
11:06:45,994 INFO  org.apache.flink.runtime.taskmanager.TaskManager             
 - Determined BLOB server address to be /192.168.200.174:48746. Starting BLOB 
cache.
11:06:45,995 INFO  org.apache.flink.runtime.blob.BlobCache                      
 - Created BLOB cache storage directory 
/tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e


Is this a bug?

Thanks,
Ali

No job recovery after job manager failure

Reply via email to