[ https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17215665#comment-17215665 ]
Stan Henderson commented on ZOOKEEPER-3940: ------------------------------------------- [~maoling] I tried a different test today. I pulled the 3.6.2 zookeeper image, tagged it, and pushed it to my docker repository without any modifications. I then deployed to my 3 Linux VMs. I see the same issue of stopping one of the zoo servers, and it not rejoining without restarting some other server. The default zoo.cfg from zookeeper:3.6.2 {code:java} dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=5 syncLimit=2 autopurge.snapRetainCount=3 autopurge.purgeInterval=0 maxClientCnxns=60 standaloneEnabled=true admin.enableServer=true server.4=zoo4:2888:3888 server.5=zoo5:2888:3888 server.6=zoo6:2888:3888 {code} Restart zoo4, and loops with '*Notification time out: 60000*' Oct 16 16:28:24 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:28:24,138 [myid:4] - INFO [QuorumPeer[myid=4](plain=disabled)(secure=disabled):FastLeaderElection@979] - Notification time out: 60000 zoo5 and zoo6 report '*configuration error, or a bug*' {code:java} Oct 16 16:29:24 zookeeperpoc5 docker[zookeeper_zoo5_1][6790]: 2020-10-16 21:29:24,172 [myid:5] - WARN [ListenerHandler-zoo5/172.17.0.2:3888:QuorumCnxManager@662] - *{color:#DE350B}We got a connection request from a server with our own ID. This should be either a configuration error, or a bug.{color}* Oct 16 16:29:24 zookeeperpoc6 docker[zookeeper_zoo6_1][2985]: 2020-10-16 21:29:24,157 [myid:6] - WARN [ListenerHandler-zoo6/172.17.0.2:3888:QuorumCnxManager@662] - *{color:#DE350B}We got a connection request from a server with our own ID. This should be either a configuration error, or a bug.{color}* {code} Restart zoo6 zoo4 recovers and rejoins {code:java} Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:52,974 [myid:4] - INFO [ListenerHandler-zoo4/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070] - Received connection request from /9.48.164.42:33134 Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:52,974 [myid:4] - INFO [ListenerHandler-zoo4/172.17.0.2:3888:QuorumCnxManager$Listener$ListenerHandler@1070] - Received connection request from /9.48.164.42:33134 Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:52,994 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6, n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:52 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:52,994 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6, n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,000 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:4, n.state:LOOKING, n.leader:6, n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,000 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:4, n.state:LOOKING, n.leader:6, n.round:0x1, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,001 [myid:4] - INFO [QuorumConnectionThread-[myid=4]-16:QuorumCnxManager@513] - Have smaller server identifier, so dropping the connection: (myId:4 --> sid:5) Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,001 [myid:4] - INFO [QuorumConnectionThread-[myid=4]-16:QuorumCnxManager@513] - Have smaller server identifier, so dropping the connection: (myId:4 --> sid:5) Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,002 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6, n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,002 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6, n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,004 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6, n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,004 [myid:4] - INFO [WorkerReceiver[myid=4]:FastLeaderElection$Messenger$WorkerReceiver@389] - Notification: my state:LOOKING; n.sid:6, n.state:LOOKING, n.leader:6, n.round:0x2, n.peerEpoch:0xa87b, n.zxid:0xa87b00000000, message format version:0x2, n.config version:0x0 Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,205 [myid:4] - INFO [QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@857] - Peer state changed: following Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,205 [myid:4] - INFO [QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@857] - Peer state changed: following Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,207 [myid:4] - INFO [QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@1456] - FOLLOWING Oct 16 16:34:53 zookeeperpoc4 docker[zookeeper_zoo4_1][5780]: 2020-10-16 21:34:53,207 [myid:4] - INFO [QuorumPeer[myid=4](plain=disabled)(secure=disabled):QuorumPeer@1456] - FOLLOWING {code} > Zookeeper restart of leader causes all zk nodes to not serve requests > --------------------------------------------------------------------- > > Key: ZOOKEEPER-3940 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.6.2 > Environment: dataDir=/data > dataLogDir=/datalog > tickTime=2000 > initLimit=10 > syncLimit=5 > maxClientCnxns=60 > autopurge.snapRetainCount=10 > autopurge.purgeInterval=24 > leaderServes=yes > standaloneEnabled=false > admin.enableServer=false > snapshot.trust.empty=true > audit.enable=true > 4lw.commands.whitelist=* > sslQuorum=true > quorumListenOnAllIPs=true > portUnification=false > serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory > ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.quorum.keyStore.password=******** > ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.quorum.trustStore.password=******** > ssl.quorum.protocol=TLSv1.2 > ssl.quorum.enabledProtocols=TLSv1.2 > ssl.client.enable=true > secureClientPort=2281 > client.portUnification=true > clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty > ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.keyStore.password=******** > ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.trustStore.password=******** > ssl.protocol=TLSv1.2 > ssl.enabledProtocols=TLSv1.2 > reconfigEnabled=false > server.1=zoo1:2888:3888:participant;2181 > server.2=zoo2:2888:3888:participant;2181 > server.3=zoo3:2888:3888:participant;2181 > Reporter: Stan Henderson > Priority: Critical > Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip, > zk-docker-containers.log.zip, zoo.cfg, zoo.cfg, zoo1-docker-containers.log, > zoo1-docker-containers.log, zoo1-follower.log, zoo2-docker-containers.log, > zoo2-leader.log, zoo3-docker-containers.log, zoo3-follower.log > > > We have configured a 3 node zookeeper cluster using the 3.6.2 version in a > Docker version 1.12.1 containerized environment. This corresponds to Sep 16 > 20:03:01 in the attached docker-containers.log files. > NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 > branch > As a part of our testing, we have restarted each of the zookeeper nodes and > have seen the following behaviour: > zoo1, zoo2, and zoo3 healthy (zoo1 is leader) > We started our testing at approximately Sep 17 13:01:05 in the attached > docker-containers.log files. > 1. (simulate patching zoo2) > - restart zoo2 > - zk_synced_followers 1 > - zoo1 leader > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - waited 5 minutes with no change > - restart zoo3 > - zoo1 leader > - zk_synced_followers 1 > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - restart zoo2 > - no changes > - restart zoo3 > - zoo1 leader > - zk_synced_followers 2 > - zoo2 healthy > - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests) > - waited 5 minutes and zoo3 returned to healthy > 2. simulate patching zoo3 > - zoo1 leader > - restart zoo3 > - zk_synced_followers 2 > - zoo1, zoo2, and zoo3 healthy > 3. simulate patching zoo1 > - zoo1 leader > - restart zoo1 > - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently > serving requests) > - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44 > - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still > unhealthy (this step was not collected in the log files). > The third case in the above scenarios is the critical one since we are no > longer able to start any of the zk nodes. > > [~maoling] this issue may relate to > https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the > first and second cases above that I am working with [~blb93] on. -- This message was sent by Atlassian Jira (v8.3.4#803005)