[ https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201993#comment-17201993 ]
maoling commented on ZOOKEEPER-3940: ------------------------------------ [~stanhend] [~blb93] 1. Case 1 and 2 is not an issue? Since the cluster finally resumes to a health state, why 5 minutes? too long? 2. I'm trying to reproduce Case 3. My env: My local Mac Docker version: Docker version 19.03.8, build afacb8b ZK version: 3.6.2 My _*zoo.cfg*_ is a little difference from yours for generating secret key is troublesome I comment out the Secure connection and SSL quorum related properties. {code:java} #sslQuorum=true #ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks #ssl.quorum.keyStore.password=******** #ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks #ssl.quorum.trustStore.password=******** #ssl.quorum.protocol=TLSv1.2 #ssl.quorum.enabledProtocols=TLSv1.2 #ssl.client.enable=true #secureClientPort=2281 #ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks #ssl.keyStore.password=******** #ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks #ssl.trustStore.password=******** #ssl.protocol=TLSv1.2 #ssl.enabledProtocols=TLSv1.2 {code} And I cannot reproduce this issue by the approach Case 3 provided(just restarting the leader) Could you please disable the above properties to re-test again? If what I said is right, we can narrow the scope: this issue was caused by the secure connection or quorum SSL feature? > Zookeeper restart of leader causes all zk nodes to not serve requests > --------------------------------------------------------------------- > > Key: ZOOKEEPER-3940 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.6.2 > Environment: dataDir=/data > dataLogDir=/datalog > tickTime=2000 > initLimit=10 > syncLimit=5 > maxClientCnxns=60 > autopurge.snapRetainCount=10 > autopurge.purgeInterval=24 > leaderServes=yes > standaloneEnabled=false > admin.enableServer=false > snapshot.trust.empty=true > audit.enable=true > 4lw.commands.whitelist=* > sslQuorum=true > quorumListenOnAllIPs=true > portUnification=false > serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory > ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.quorum.keyStore.password=******** > ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.quorum.trustStore.password=******** > ssl.quorum.protocol=TLSv1.2 > ssl.quorum.enabledProtocols=TLSv1.2 > ssl.client.enable=true > secureClientPort=2281 > client.portUnification=true > clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty > ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.keyStore.password=******** > ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.trustStore.password=******** > ssl.protocol=TLSv1.2 > ssl.enabledProtocols=TLSv1.2 > reconfigEnabled=false > server.1=zoo1:2888:3888:participant;2181 > server.2=zoo2:2888:3888:participant;2181 > server.3=zoo3:2888:3888:participant;2181 > Reporter: Stan Henderson > Priority: Critical > Attachments: zk-docker-containers.log.zip, zoo.cfg > > > We have configured a 3 node zookeeper cluster using the 3.6.2 version in a > Docker version 1.12.1 containerized environment. This corresponds to Sep 16 > 20:03:01 in the attached docker-containers.log files. > NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 > branch > As a part of our testing, we have restarted each of the zookeeper nodes and > have seen the following behaviour: > zoo1, zoo2, and zoo3 healthy (zoo1 is leader) > We started our testing at approximately Sep 17 13:01:05 in the attached > docker-containers.log files. > 1. (simulate patching zoo2) > - restart zoo2 > - zk_synced_followers 1 > - zoo1 leader > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - waited 5 minutes with no change > - restart zoo3 > - zoo1 leader > - zk_synced_followers 1 > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - restart zoo2 > - no changes > - restart zoo3 > - zoo1 leader > - zk_synced_followers 2 > - zoo2 healthy > - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests) > - waited 5 minutes and zoo3 returned to healthy > 2. simulate patching zoo3 > - zoo1 leader > - restart zoo3 > - zk_synced_followers 2 > - zoo1, zoo2, and zoo3 healthy > 3. simulate patching zoo1 > - zoo1 leader > - restart zoo1 > - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently > serving requests) > - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44 > - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still > unhealthy (this step was not collected in the log files). > The third case in the above scenarios is the critical one since we are no > longer able to start any of the zk nodes. > > [~maoling] this issue may relate to > https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the > first and second cases above that I am working with [~blb93] on. -- This message was sent by Atlassian Jira (v8.3.4#803005)