Re: No job recovery after job manager failure

Ufuk Celebi Mon, 21 Dec 2015 02:19:11 -0800

The issue was that 1) local state backend but loss of VM and 2) recovery did 
not log any Exception.


2) has been addressed in this PR: https://github.com/apache/flink/pull/1472

– Ufuk

> On 17 Dec 2015, at 15:26, Ufuk Celebi <[email protected]> wrote:
> 
> As an update: I’m investigating this. Ali sent me the log files.
> 
>> On 16 Dec 2015, at 18:15, Ufuk Celebi <[email protected]> wrote:
>> 
>> Hey Ali,
>> 
>> can you send me the complete logs?
>> 
>> I don’t think it’s possible via the mailing list. Just send it to my private 
>> email [email protected].
>> 
>> – Ufuk
>> 
>>> On 16 Dec 2015, at 17:26, Kashmar, Ali <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I’m trying to test HA on a 3-node Flink cluster (task slots = 48). So I 
>>> started a job with parallelism = 32 and waited for a few seconds so that 
>>> all nodes are doing work. I then shut down the node that had the leader job 
>>> manager, and by shut down I mean I powered off the virtual machine running 
>>> it. I monitored the logs to see what was going on and I saw that zookeeper 
>>> has elected a new leader. I also saw a log for recovering jobs, but nothing 
>>> actually happens. Here’s the job manager log from the node that became the 
>>> leader:
>>> 
>>> 11:06:43,448 INFO  org.apache.flink.runtime.jobmanager.JobManager           
>>>      - JobManager akka.tcp://[email protected]:56023/user/jobmanager 
>>> was granted leadership with leader session ID 
>>> Some(16eb0d0a-2cae-473e-aa41-679a87d3669b).
>>> 11:06:45,912 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever  
>>>      - New leader reachable under 
>>> akka.tcp://[email protected]:56023/user/jobmanager:16eb0d0a-2cae-473e-aa41-679a87d3669b.
>>> 11:06:45,963 INFO  org.apache.flink.runtime.instance.InstanceManager        
>>>      - Registered TaskManager at 192.168.200.174 
>>> (akka.tcp://[email protected]:52324/user/taskmanager) as 
>>> e8720b15c63d508e8dc19b19e70d4c88. Current number of registered hosts is 1. 
>>> Current number of alive task slots is 16.
>>> 11:06:45,975 INFO  org.apache.flink.runtime.instance.InstanceManager        
>>>      - Registered TaskManager at 192.168.200.175 
>>> (akka.tcp://[email protected]:46612/user/taskmanager) as 
>>> 766a7938746c2d41e817e2ceb42a9a64. Current number of registered hosts is 2. 
>>> Current number of alive task slots is 32.
>>> 11:08:25,925 INFO  org.apache.flink.runtime.jobmanager.JobManager           
>>>      - Recovering all jobs.
>>> 
>>> 
>>> I waited 10 minutes after that last log and there was no change. And here’s 
>>> the task-manager log from the same node:
>>> 
>>> 
>>> 11:06:45,914 INFO  org.apache.flink.runtime.taskmanager.TaskManager         
>>>      - Trying to register at JobManager 
>>> akka.tcp://[email protected]:56023/user/jobmanager (attempt 1, timeout: 
>>> 500 milliseconds)
>>> 11:06:45,983 INFO  org.apache.flink.runtime.taskmanager.TaskManager         
>>>      - Successful registration at JobManager 
>>> (akka.tcp://[email protected]:56023/user/jobmanager), starting network 
>>> stack and library cache.
>>> 11:06:45,988 INFO  org.apache.flink.runtime.io.network.netty.NettyClient    
>>>      - Successful initialization (took 4 ms).
>>> 11:06:45,994 INFO  org.apache.flink.runtime.io.network.netty.NettyServer    
>>>      - Successful initialization (took 6 ms). Listening on SocketAddress 
>>> /192.168.200.174:39322.
>>> 11:06:45,994 INFO  org.apache.flink.runtime.taskmanager.TaskManager         
>>>      - Determined BLOB server address to be /192.168.200.174:48746. 
>>> Starting BLOB cache.
>>> 11:06:45,995 INFO  org.apache.flink.runtime.blob.BlobCache                  
>>>      - Created BLOB cache storage directory 
>>> /tmp/blobStore-4d4e4cc2-c161-4df1-acea-abda2b28d39e
>>> 
>>> 
>>> Is this a bug?
>>> 
>>> Thanks,
>>> Ali
>> 
>

Re: No job recovery after job manager failure

Reply via email to