Re: Cluster failure after zookeeper glitch.

Andrew Ge Wu Thu, 19 Jan 2017 05:56:53 -0800

Hi Stefan

Yes we are running in HA mode with dedicated zookeeper cluster. As far as I can 
see it looks like a networking issue with zookeeper cluster.
2 out of 5 zookeeper reported something around the same time:


server1
2017-01-19 11:52:13,044 [myid:1] - WARN  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:150)
        at java.net.SocketInputStream.read(SocketInputStream.java:121)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at 
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at 
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at 
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
        at 
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at 
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
2017-01-19 11:52:13,045 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
        at 
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at 
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:790)
2017-01-19 11:52:13,045 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /172.27.163.227:51800 which had sessionid 
0x159b505820a0009
2017-01-19 11:52:13,046 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /172.27.163.227:51798 which had sessionid 
0x159b505820a0008
2017-01-19 11:52:13,046 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /0:0:0:0:0:0:0:1:46891 which had sessionid 
0x1537b32bbe100ad
2017-01-19 11:52:13,046 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1007] - Closed socket 
connection for client /172.27.165.64:50075 which had sessionid 0x159b505820a000d
2017-01-19 11:52:13,046 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@139] - 
Shutting down
2017-01-19 11:52:13,046 [myid:1] - INFO  
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@441] - shutting down


server2
2017-01-19 11:52:13,061 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 1 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:13,082 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:13,083 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:13,284 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:13,285 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:13,310 [myid:2] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted 
socket connection from /172.27.163.227:39302
2017-01-19 11:52:13,311 [myid:2] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@861] - Client 
attempting to renew session 0x159b505820a0009 at /172.27.163.227:39302
2017-01-19 11:52:13,312 [myid:2] - INFO  
[NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:Learner@108] - Revalidating client: 
0x159b505820a0009
2017-01-19 11:52:13,687 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:13,687 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:14,488 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 1 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:14,489 [myid:2] - INFO  
[WorkerReceiver[myid=2]:FastLeaderElection@597] - Notification: 1 (message 
format version), 4 (n.leader), 0x4000000cb (n.zxid), 0x4 (n.round), LOOKING 
(n.state), 4 (n.sid), 0x4 (n.peerEpoch) FOLLOWING (my state)
2017-01-19 11:52:14,719 [myid:2] - INFO  
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@617] - Established 
session 0x159b505820a0009 with negotiated timeout 40000 for client 
/172.27.163.227:39302

I can’t say for sure if data in zookeeper is corrupted at that time. I guess 
Flink is kinda sensitive on that?


Thanks



Andrew





> On 19 Jan 2017, at 14:19, Stefan Richter <s.rich...@data-artisans.com> wrote:
> 
> Hi,
> 
> I think depending on your configuration of Flink (are you using high 
> availability mode?) and the type of ZK glitches we are talking about, it can 
> very well be that some of Flink’s meta data in ZK got corrupted and the 
> system can not longer operate. But for a deeper analysis, we would need more 
> details about your configuration and the ZK problem.
> 
> Best,
> Stefan
> 
>> Am 19.01.2017 um 13:16 schrieb Andrew Ge Wu <andrew.ge...@eniro.com>:
>> 
>> Hi,
>> 
>> 
>> We recently had several zookeeper glitch, when that happens it seems to take 
>> flink cluster with it.
>> 
>> We are running on 1.03
>> 
>> It started like this:
>> 
>> 
>> 2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn                
>>                - Unable to read additional data from server sessionid 
>> 0x159b505820a0008, likely server has closed socket, closing socket 
>> connection and attempting reconnect
>> 2017-01-19 11:52:13,047 INFO  org.apache.zookeeper.ClientCnxn                
>>                - Unable to read additional data from server sessionid 
>> 0x159b505820a0009, likely server has closed socket, closing socket 
>> connection and attempting reconnect
>> 2017-01-19 11:52:13,151 INFO  
>> org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
>>   - State change: SUSPENDED
>> 2017-01-19 11:52:13,151 INFO  
>> org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
>>   - State change: SUSPENDED
>> 2017-01-19 11:52:13,166 WARN  
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
>> ZooKeeper connection SUSPENDED. Changes to the submitted job graphs are not 
>> monitored (temporarily).
>> 2017-01-19 11:52:13,169 INFO  org.apache.flink.runtime.jobmanager.JobManager 
>>                - JobManager akka://flink/user/jobmanager#1976923422 was 
>> revoked leadership.
>> 2017-01-19 11:52:13,179 INFO  
>> org.apache.flink.runtime.executiongraph.ExecutionGraph        - op1 -> (Map, 
>> Map -> op2) (18/24) (5336dd375eb12616c5a0e93c84f93465) switched from RUNNING 
>> to FAILED
>> 
>> 
>> 
>> Then our web-ui stopped serving and job manager stuck in an exception loop 
>> like this:
>> 2017-01-19 13:05:13,521 WARN  org.apache.flink.runtime.jobmanager.JobManager 
>>                - Discard message 
>> LeaderSessionMessage(0318ecf5-7069-41b2-a793-2f24bdbaa287,01/19/2017 
>> 13:05:13     Job execution switched to status RESTARTING.) because the 
>> expected leader session I
>> D None did not equal the received leader session ID 
>> Some(0318ecf5-7069-41b2-a793-2f24bdbaa287).
>> 2017-01-19 13:05:13,521 INFO  
>> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy  - 
>> Delaying retry of job execution for xxxxx ms …
>> 
>> 
>> Is it because we misconfigured anything? or this is expected behavior? When 
>> this happens we have to restart the cluster to bring it back.
>> 
>> 
>> Thanks!
>> 
>> 
>> Andrew
>> -- 
>> Confidentiality Notice: This e-mail transmission may contain confidential 
>> or legally privileged information that is intended only for the individual 
>> or entity named in the e-mail address. If you are not the intended 
>> recipient, you are hereby notified that any disclosure, copying, 
>> distribution, or reliance upon the contents of this e-mail is strictly 
>> prohibited and may be unlawful. If you have received this e-mail in error, 
>> please notify the sender immediately by return e-mail and delete all copies 
>> of this message.
> 


-- 
Confidentiality Notice: This e-mail transmission may contain confidential 
or legally privileged information that is intended only for the individual 
or entity named in the e-mail address. If you are not the intended 
recipient, you are hereby notified that any disclosure, copying, 
distribution, or reliance upon the contents of this e-mail is strictly 
prohibited and may be unlawful. If you have received this e-mail in error, 
please notify the sender immediately by return e-mail and delete all copies 
of this message.

Re: Cluster failure after zookeeper glitch.

Reply via email to