Hey Ali, can you send me the complete logs?
I don’t think it’s possible via the mailing list. Just send it to my private email u...@apache.org. – Ufuk > On 16 Dec 2015, at 17:26, Kashmar, Ali <ali.kash...@emc.com> wrote: > > Hi, > > I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I > started a job with parallelism = 32 and waited for a few seconds so that all > nodes are doing work. I then shut down the node that had the leader job > manager, and by shut down I mean I powered off the virtual machine running > it. I monitored the logs to see what was going on and I saw that zookeeper > has elected a new leader. I also saw a log for recovering jobs, but nothing > actually happens. Here’s the job manager log from the node that became the > leader: > > 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager > - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager was > granted leadership with leader session ID > Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). > 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever > - New leader reachable under > akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. > 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at 192.168.200.174 > (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as > e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. > Current number of alive task slots is 16. > 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager > - Registered TaskManager at 192.168.200.175 > (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as > 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. > Current number of alive task slots is 32. > 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager > - Recovering all jobs. > > > I waited 10 minutes after that last log and there was no change. And here’s > the task-manager log from the same node: > > > 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Trying to register at JobManager > akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: > 500 milliseconds) > 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Successful registration at JobManager > (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network > stack and library cache. > 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient > - Successful initialization (took 4 ms). > 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer > - Successful initialization (took 6 ms). Listening on SocketAddress > /192.168.200.174:39322. > 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager > - Determined BLOB server address to be /192.168.200.174:48746. Starting > BLOB cache. > 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache > - Created BLOB cache storage directory > /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e > > > Is this a bug? > > Thanks, > Ali