The issue was that 1) local state backend but loss of VM and 2) recovery did not log any Exception.
2) has been addressed in this PR: https://github.com/apache/flink/pull/1472 – Ufuk > On 17 Dec 2015, at 15:26, Ufuk Celebi <u...@apache.org> wrote: > > As an update: I’m investigating this. Ali sent me the log files. > >> On 16 Dec 2015, at 18:15, Ufuk Celebi <u...@apache.org> wrote: >> >> Hey Ali, >> >> can you send me the complete logs? >> >> I don’t think it’s possible via the mailing list. Just send it to my private >> email u...@apache.org. >> >> – Ufuk >> >>> On 16 Dec 2015, at 17:26, Kashmar, Ali <ali.kash...@emc.com> wrote: >>> >>> Hi, >>> >>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I >>> started a job with parallelism = 32 and waited for a few seconds so that >>> all nodes are doing work. I then shut down the node that had the leader job >>> manager, and by shut down I mean I powered off the virtual machine running >>> it. I monitored the logs to see what was going on and I saw that zookeeper >>> has elected a new leader. I also saw a log for recovering jobs, but nothing >>> actually happens. Here’s the job manager log from the node that became the >>> leader: >>> >>> 11:06:43,448 INFO org.apache.flink.runtime.jobmanager.JobManager >>> - JobManager akka.tcp://flink@192.168.200.174:56023/user/jobmanager >>> was granted leadership with leader session ID >>> Some(16eb0d0a-2cae-473e-aa41-679a87d3669b). >>> 11:06:45,912 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever >>> - New leader reachable under >>> akka.tcp://flink@192.168.200.174:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b. >>> 11:06:45,963 INFO org.apache.flink.runtime.instance.InstanceManager >>> - Registered TaskManager at 192.168.200.174 >>> (akka.tcp://flink@192.168.200.174:52324/user/taskmanager) as >>> e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. >>> Current number of alive task slots is 16. >>> 11:06:45,975 INFO org.apache.flink.runtime.instance.InstanceManager >>> - Registered TaskManager at 192.168.200.175 >>> (akka.tcp://flink@192.168.200.175:46612/user/taskmanager) as >>> 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. >>> Current number of alive task slots is 32. >>> 11:08:25,925 INFO org.apache.flink.runtime.jobmanager.JobManager >>> - Recovering all jobs. >>> >>> >>> I waited 10 minutes after that last log and there was no change. And here’s >>> the task-manager log from the same node: >>> >>> >>> 11:06:45,914 INFO org.apache.flink.runtime.taskmanager.TaskManager >>> - Trying to register at JobManager >>> akka.tcp://flink@192.168.200.174:56023/user/jobmanager (attempt 1, timeout: >>> 500 milliseconds) >>> 11:06:45,983 INFO org.apache.flink.runtime.taskmanager.TaskManager >>> - Successful registration at JobManager >>> (akka.tcp://flink@192.168.200.174:56023/user/jobmanager), starting network >>> stack and library cache. >>> 11:06:45,988 INFO org.apache.flink.runtime.io.network.netty.NettyClient >>> - Successful initialization (took 4 ms). >>> 11:06:45,994 INFO org.apache.flink.runtime.io.network.netty.NettyServer >>> - Successful initialization (took 6 ms). Listening on SocketAddress >>> /192.168.200.174:39322. >>> 11:06:45,994 INFO org.apache.flink.runtime.taskmanager.TaskManager >>> - Determined BLOB server address to be /192.168.200.174:48746. >>> Starting BLOB cache. >>> 11:06:45,995 INFO org.apache.flink.runtime.blob.BlobCache >>> - Created BLOB cache storage directory >>> /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e >>> >>> >>> Is this a bug? >>> >>> Thanks, >>> Ali >> >