[ https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17202294#comment-17202294 ]
Stan Henderson commented on ZOOKEEPER-3940: ------------------------------------------- [~maoling] I redeployed without SSL. I restarted the leader, zoo2, and zoo2 never returned to a good state. I attached the nonssl logs and zoo.cfg dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=10 syncLimit=5 maxClientCnxns=60 autopurge.snapRetainCount=10 autopurge.purgeInterval=24 leaderServes=yes standaloneEnabled=false admin.enableServer=false snapshot.trust.empty=true audit.enable=true 4lw.commands.whitelist=* quorumListenOnAllIPs=true reconfigEnabled=false server.1=zoo1:2888:3888:participant;2181 server.2=zoo2:2888:3888:participant;2181 server.3=zoo3:2888:3888:participant;2181 FYI [~blb93] > Zookeeper restart of leader causes all zk nodes to not serve requests > --------------------------------------------------------------------- > > Key: ZOOKEEPER-3940 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.6.2 > Environment: dataDir=/data > dataLogDir=/datalog > tickTime=2000 > initLimit=10 > syncLimit=5 > maxClientCnxns=60 > autopurge.snapRetainCount=10 > autopurge.purgeInterval=24 > leaderServes=yes > standaloneEnabled=false > admin.enableServer=false > snapshot.trust.empty=true > audit.enable=true > 4lw.commands.whitelist=* > sslQuorum=true > quorumListenOnAllIPs=true > portUnification=false > serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory > ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.quorum.keyStore.password=******** > ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.quorum.trustStore.password=******** > ssl.quorum.protocol=TLSv1.2 > ssl.quorum.enabledProtocols=TLSv1.2 > ssl.client.enable=true > secureClientPort=2281 > client.portUnification=true > clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty > ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.keyStore.password=******** > ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.trustStore.password=******** > ssl.protocol=TLSv1.2 > ssl.enabledProtocols=TLSv1.2 > reconfigEnabled=false > server.1=zoo1:2888:3888:participant;2181 > server.2=zoo2:2888:3888:participant;2181 > server.3=zoo3:2888:3888:participant;2181 > Reporter: Stan Henderson > Priority: Critical > Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip, > zk-docker-containers.log.zip, zoo.cfg > > > We have configured a 3 node zookeeper cluster using the 3.6.2 version in a > Docker version 1.12.1 containerized environment. This corresponds to Sep 16 > 20:03:01 in the attached docker-containers.log files. > NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 > branch > As a part of our testing, we have restarted each of the zookeeper nodes and > have seen the following behaviour: > zoo1, zoo2, and zoo3 healthy (zoo1 is leader) > We started our testing at approximately Sep 17 13:01:05 in the attached > docker-containers.log files. > 1. (simulate patching zoo2) > - restart zoo2 > - zk_synced_followers 1 > - zoo1 leader > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - waited 5 minutes with no change > - restart zoo3 > - zoo1 leader > - zk_synced_followers 1 > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - restart zoo2 > - no changes > - restart zoo3 > - zoo1 leader > - zk_synced_followers 2 > - zoo2 healthy > - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests) > - waited 5 minutes and zoo3 returned to healthy > 2. simulate patching zoo3 > - zoo1 leader > - restart zoo3 > - zk_synced_followers 2 > - zoo1, zoo2, and zoo3 healthy > 3. simulate patching zoo1 > - zoo1 leader > - restart zoo1 > - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently > serving requests) > - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44 > - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still > unhealthy (this step was not collected in the log files). > The third case in the above scenarios is the critical one since we are no > longer able to start any of the zk nodes. > > [~maoling] this issue may relate to > https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the > first and second cases above that I am working with [~blb93] on. -- This message was sent by Atlassian Jira (v8.3.4#803005)