[ https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214265#comment-17214265 ]
Stan Henderson commented on ZOOKEEPER-3940: ------------------------------------------- [~maoling] I've worked around the issue regarding netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar file by pulling it from https://dl.bintray.com/netty/downloads for our TLS deployment. I used the example stack.yml from here: https://hub.docker.com/_/zookeeper to start up a quorum with 3 nodes on my Windows laptop and my Linux VM. These are not using SSL/TLS, so NIOServerCnxnFactory and ClientCnxnSocketNIO are used versus NettyServerCnxnFactory and ClientCnxnSocketNetty. I am unable to reproduce any of the issues reported in this defect. However, in my other two setups (one with SSL/TLS, and one with non-SSL), each with 3 nodes, I am able to reproduce the issue where restarting a node and the node does not recover without restarting some other node in the cluster. I am not able to reproduce the issue of restarting the leader and then the quorum does not come back up. I am consistently able to stop the node with the lowest id and it not rejoin the quorum. One difference I see between the ZOO_SERVERS variable between my distributed environments versus the standalone one on my laptop (or Linux VM) is the following: Distributed, all nodes have the following (and note, this works for us in the past with 3.4.x release): ZOO_SERVERS=server.1=zoo1:2888:3888 server.2=zoo2:2888:3888 server.3=zoo3:2888:3888 Standalone has the following where each node has 0.0.0.0 as the IP address for it's own node. zoo1: ZOO_SERVERS=server.1=0.0.0.0:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=zoo3:2888:3888;2181 zoo2: ZOO_SERVERS=server.1=zoo1:2888:3888;2181 server.2=0.0.0.0:2888:3888;2181 server.3=zoo3:2888:3888;2181 zoo3: ZOO_SERVERS=server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 server.3=0.0.0.0:2888:3888;2181 zoo.cfg for laptop/standalone {code:java} dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=10 syncLimit=5 maxClientCnxns=60 autopurge.snapRetainCount=10 autopurge.purgeInterval=24 leaderServes=yes standaloneEnabled=false admin.enableServer=false snapshot.trust.empty=true audit.enable=true 4lw.commands.whitelist=* quorumListenOnAllIPs=true reconfigEnabled=false server.1=zoo1:2888:3888:participant;2181 server.2=zoo2:2888:3888:participant;2181 server.3=zoo3:2888:3888:participant;2181 {code} zoo.cfg for distributed non-SSL {code:java} dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=30 syncLimit=15 maxClientCnxns=60 autopurge.snapRetainCount=10 autopurge.purgeInterval=24 leaderServes=yes standaloneEnabled=false admin.enableServer=false snapshot.trust.empty=true audit.enable=true 4lw.commands.whitelist=* quorumListenOnAllIPs=true serverCnxnFactory=org.apache.zookeeper.server.NIOServerCnxnFactory clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNIO reconfigEnabled=false server.4=zoo4:2888:3888:participant;2181 server.5=zoo5:2888:3888:participant;2181 server.6=zoo6:2888:3888:participant;2181 {code} zoo.cfg for distributed SSL/TLSv1.2 {code:java} dataDir=/data dataLogDir=/datalog tickTime=2000 initLimit=30 syncLimit=15 maxClientCnxns=60 autopurge.snapRetainCount=10 autopurge.purgeInterval=24 leaderServes=yes standaloneEnabled=false admin.enableServer=false snapshot.trust.empty=true audit.enable=true 4lw.commands.whitelist=* sslQuorum=true quorumListenOnAllIPs=true portUnification=false serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks ssl.quorum.keyStore.password=Ap0ll0C3rt ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks ssl.quorum.trustStore.password=Ap0ll0C3rt ssl.quorum.protocol=TLSv1.2 ssl.quorum.enabledProtocols=TLSv1.2 ssl.client.enable=true secureClientPort=2281 client.portUnification=true clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks ssl.keyStore.password=Ap0ll0C3rt ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks ssl.trustStore.password=Ap0ll0C3rt ssl.protocol=TLSv1.2 ssl.enabledProtocols=TLSv1.2 reconfigEnabled=false server.1=zoo1:2888:3888:participant;2181 server.2=zoo2:2888:3888:participant;2181 server.3=zoo3:2888:3888:participant;2181 {code} > Zookeeper restart of leader causes all zk nodes to not serve requests > --------------------------------------------------------------------- > > Key: ZOOKEEPER-3940 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940 > Project: ZooKeeper > Issue Type: Bug > Components: quorum, server > Affects Versions: 3.6.2 > Environment: dataDir=/data > dataLogDir=/datalog > tickTime=2000 > initLimit=10 > syncLimit=5 > maxClientCnxns=60 > autopurge.snapRetainCount=10 > autopurge.purgeInterval=24 > leaderServes=yes > standaloneEnabled=false > admin.enableServer=false > snapshot.trust.empty=true > audit.enable=true > 4lw.commands.whitelist=* > sslQuorum=true > quorumListenOnAllIPs=true > portUnification=false > serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory > ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.quorum.keyStore.password=******** > ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.quorum.trustStore.password=******** > ssl.quorum.protocol=TLSv1.2 > ssl.quorum.enabledProtocols=TLSv1.2 > ssl.client.enable=true > secureClientPort=2281 > client.portUnification=true > clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty > ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks > ssl.keyStore.password=******** > ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks > ssl.trustStore.password=******** > ssl.protocol=TLSv1.2 > ssl.enabledProtocols=TLSv1.2 > reconfigEnabled=false > server.1=zoo1:2888:3888:participant;2181 > server.2=zoo2:2888:3888:participant;2181 > server.3=zoo3:2888:3888:participant;2181 > Reporter: Stan Henderson > Priority: Critical > Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip, > zk-docker-containers.log.zip, zoo.cfg, zoo1-docker-containers.log, > zoo1-docker-containers.log, zoo2-docker-containers.log, > zoo3-docker-containers.log > > > We have configured a 3 node zookeeper cluster using the 3.6.2 version in a > Docker version 1.12.1 containerized environment. This corresponds to Sep 16 > 20:03:01 in the attached docker-containers.log files. > NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 > branch > As a part of our testing, we have restarted each of the zookeeper nodes and > have seen the following behaviour: > zoo1, zoo2, and zoo3 healthy (zoo1 is leader) > We started our testing at approximately Sep 17 13:01:05 in the attached > docker-containers.log files. > 1. (simulate patching zoo2) > - restart zoo2 > - zk_synced_followers 1 > - zoo1 leader > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - waited 5 minutes with no change > - restart zoo3 > - zoo1 leader > - zk_synced_followers 1 > - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests) > - zoo3 healthy > - restart zoo2 > - no changes > - restart zoo3 > - zoo1 leader > - zk_synced_followers 2 > - zoo2 healthy > - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests) > - waited 5 minutes and zoo3 returned to healthy > 2. simulate patching zoo3 > - zoo1 leader > - restart zoo3 > - zk_synced_followers 2 > - zoo1, zoo2, and zoo3 healthy > 3. simulate patching zoo1 > - zoo1 leader > - restart zoo1 > - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently > serving requests) > - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44 > - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still > unhealthy (this step was not collected in the log files). > The third case in the above scenarios is the critical one since we are no > longer able to start any of the zk nodes. > > [~maoling] this issue may relate to > https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the > first and second cases above that I am working with [~blb93] on. -- This message was sent by Atlassian Jira (v8.3.4#803005)