Cluster failure after zookeeper glitch.

Andrew Ge Wu Thu, 19 Jan 2017 04:17:30 -0800

Hi,


We recently had several zookeeper glitch, when that happens it seems to take 
flink cluster with it.

We are running on 1.03

It started like this:


2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn                   
            - Unable to read additional data from server sessionid 
0x159b505820a0008, likely server has closed socket, closing socket connection 
and attempting reconnect
2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn                   
            - Unable to read additional data from server sessionid 
0x159b505820a0009, likely server has closed socket, closing socket connection 
and attempting reconnect
2017-01-19 11:52:13,151 INFO  
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
  - State change: SUSPENDED
2017-01-19 11:52:13,151 INFO  
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
  - State change: SUSPENDED
2017-01-19 11:52:13,166 WARN  
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not 
monitored (temporarily).
2017-01-19 11:52:13,169 INFO  org.apache.flink.runtime.jobmanager.JobManager    
            - JobManager akka://flink/user/jobmanager#1976923422 was revoked 
leadership.
2017-01-19 11:52:13,179 INFO  
org.apache.flink.runtime.executiongraph.ExecutionGraph        - op1 -> (Map, 
Map -> op2) (18/24) (5336dd375eb12616c5a0e93c84f93465) switched from RUNNING to 
FAILED



Then our web-ui stopped serving and job manager stuck in an exception loop like 
this:
2017-01-19 13:05:13,521 WARN  org.apache.flink.runtime.jobmanager.JobManager    
            - Discard message 
LeaderSessionMessage(0318ecf5-7069-41b2-a793-2f24bdbaa287,01/19/2017 13:05:13   
  Job execution switched to status RESTARTING.) because the expected leader 
session I
D None did not equal the received leader session ID 
Some(0318ecf5-7069-41b2-a793-2f24bdbaa287).
2017-01-19 13:05:13,521 INFO  
org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy  - 
Delaying retry of job execution for xxxxx ms …


Is it because we misconfigured anything? or this is expected behavior? When 
this happens we have to restart the cluster to bring it back.


Thanks!


Andrew
-- 
Confidentiality Notice: This e-mail transmission may contain confidential 
or legally privileged information that is intended only for the individual 
or entity named in the e-mail address. If you are not the intended 
recipient, you are hereby notified that any disclosure, copying, 
distribution, or reliance upon the contents of this e-mail is strictly 
prohibited and may be unlawful. If you have received this e-mail in error, 
please notify the sender immediately by return e-mail and delete all copies 
of this message.

Cluster failure after zookeeper glitch.

Reply via email to