[jira] [Commented] (ZOOKEEPER-3940) Zookeeper restart of leader causes all zk nodes to not serve requests

Stan Henderson (Jira) Wed, 14 Oct 2020 13:35:03 -0700


    [ 
https://issues.apache.org/jira/browse/ZOOKEEPER-3940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17214265#comment-17214265
 ]


Stan Henderson commented on ZOOKEEPER-3940:
-------------------------------------------

[~maoling]

I've worked around the issue regarding 
netty-transport-native-epoll-4.1.50.Final-linux-x86_64.jar file by pulling it 
from https://dl.bintray.com/netty/downloads for our TLS deployment.

I used the example stack.yml from here: https://hub.docker.com/_/zookeeper to 
start up a quorum with 3 nodes on my Windows laptop and my Linux VM.  These are 
not using SSL/TLS, so NIOServerCnxnFactory and ClientCnxnSocketNIO are used 
versus NettyServerCnxnFactory and ClientCnxnSocketNetty. I am unable to 
reproduce any of the issues reported in this defect.

However, in my other two setups (one with SSL/TLS, and one with non-SSL), each 
with 3 nodes, I am able to reproduce the issue where restarting a node and the 
node does not recover without restarting some other node in the cluster.  I am 
not able to reproduce the issue of restarting the leader and then the quorum 
does  not come back up.

I am consistently able to stop the node with the lowest id and it not rejoin 
the quorum.

One difference I see between the ZOO_SERVERS variable between my distributed 
environments versus the standalone one on my laptop (or Linux VM) is the 
following:

Distributed, all nodes have the following (and note, this works for us in the 
past with 3.4.x release):
ZOO_SERVERS=server.1=zoo1:2888:3888  server.2=zoo2:2888:3888  
server.3=zoo3:2888:3888

Standalone has the following where each node has 0.0.0.0 as the IP address for 
it's own node.  
zoo1: ZOO_SERVERS=server.1=0.0.0.0:2888:3888;2181 server.2=zoo2:2888:3888;2181 
server.3=zoo3:2888:3888;2181
zoo2: ZOO_SERVERS=server.1=zoo1:2888:3888;2181 server.2=0.0.0.0:2888:3888;2181 
server.3=zoo3:2888:3888;2181
zoo3: ZOO_SERVERS=server.1=zoo1:2888:3888;2181 server.2=zoo2:2888:3888;2181 
server.3=0.0.0.0:2888:3888;2181

zoo.cfg for laptop/standalone
{code:java}
dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=10
syncLimit=5
maxClientCnxns=60
autopurge.snapRetainCount=10
autopurge.purgeInterval=24
leaderServes=yes
standaloneEnabled=false
admin.enableServer=false
snapshot.trust.empty=true
audit.enable=true
4lw.commands.whitelist=*
quorumListenOnAllIPs=true
reconfigEnabled=false
server.1=zoo1:2888:3888:participant;2181
server.2=zoo2:2888:3888:participant;2181
server.3=zoo3:2888:3888:participant;2181
{code}

zoo.cfg for distributed non-SSL
{code:java}
dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=30
syncLimit=15
maxClientCnxns=60
autopurge.snapRetainCount=10
autopurge.purgeInterval=24
leaderServes=yes
standaloneEnabled=false
admin.enableServer=false
snapshot.trust.empty=true
audit.enable=true
4lw.commands.whitelist=*
quorumListenOnAllIPs=true
serverCnxnFactory=org.apache.zookeeper.server.NIOServerCnxnFactory
clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNIO
reconfigEnabled=false
server.4=zoo4:2888:3888:participant;2181
server.5=zoo5:2888:3888:participant;2181
server.6=zoo6:2888:3888:participant;2181
{code}

zoo.cfg for distributed SSL/TLSv1.2
{code:java}
dataDir=/data
dataLogDir=/datalog
tickTime=2000
initLimit=30
syncLimit=15
maxClientCnxns=60
autopurge.snapRetainCount=10
autopurge.purgeInterval=24
leaderServes=yes
standaloneEnabled=false
admin.enableServer=false
snapshot.trust.empty=true
audit.enable=true
4lw.commands.whitelist=*
sslQuorum=true
quorumListenOnAllIPs=true
portUnification=false
serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
ssl.quorum.keyStore.password=Ap0ll0C3rt
ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
ssl.quorum.trustStore.password=Ap0ll0C3rt
ssl.quorum.protocol=TLSv1.2
ssl.quorum.enabledProtocols=TLSv1.2
ssl.client.enable=true
secureClientPort=2281
client.portUnification=true
clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
ssl.keyStore.password=Ap0ll0C3rt
ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
ssl.trustStore.password=Ap0ll0C3rt
ssl.protocol=TLSv1.2
ssl.enabledProtocols=TLSv1.2
reconfigEnabled=false
server.1=zoo1:2888:3888:participant;2181
server.2=zoo2:2888:3888:participant;2181
server.3=zoo3:2888:3888:participant;2181
{code}




> Zookeeper restart of leader causes all zk nodes to not serve requests
> ---------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-3940
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-3940
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: quorum, server
>    Affects Versions: 3.6.2
>         Environment: dataDir=/data
> dataLogDir=/datalog
> tickTime=2000
> initLimit=10
> syncLimit=5
> maxClientCnxns=60
> autopurge.snapRetainCount=10
> autopurge.purgeInterval=24
> leaderServes=yes
> standaloneEnabled=false
> admin.enableServer=false
> snapshot.trust.empty=true
> audit.enable=true
> 4lw.commands.whitelist=*
> sslQuorum=true
> quorumListenOnAllIPs=true
> portUnification=false
> serverCnxnFactory=org.apache.zookeeper.server.NettyServerCnxnFactory
> ssl.quorum.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.quorum.keyStore.password=********
> ssl.quorum.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.quorum.trustStore.password=********
> ssl.quorum.protocol=TLSv1.2
> ssl.quorum.enabledProtocols=TLSv1.2
> ssl.client.enable=true
> secureClientPort=2281
> client.portUnification=true
> clientCnxnSocket=org.apache.zookeeper.ClientCnxnSocketNetty
> ssl.keyStore.location=/apache-zookeeper-3.6.2-bin/java/keystore_zoo1.jks
> ssl.keyStore.password=********
> ssl.trustStore.location=/apache-zookeeper-3.6.2-bin/java/truststore.jks
> ssl.trustStore.password=********
> ssl.protocol=TLSv1.2
> ssl.enabledProtocols=TLSv1.2
> reconfigEnabled=false
> server.1=zoo1:2888:3888:participant;2181
> server.2=zoo2:2888:3888:participant;2181
> server.3=zoo3:2888:3888:participant;2181
>            Reporter: Stan Henderson
>            Priority: Critical
>         Attachments: nossl-zoo.cfg, zk-docker-containers-nossl.log.zip, 
> zk-docker-containers.log.zip, zoo.cfg, zoo1-docker-containers.log, 
> zoo1-docker-containers.log, zoo2-docker-containers.log, 
> zoo3-docker-containers.log
>
>
> We have configured a 3 node zookeeper cluster using the 3.6.2 version in a 
> Docker version 1.12.1 containerized environment. This corresponds to Sep 16 
> 20:03:01 in the attached docker-containers.log files.
> NOTE: We use the Dockerfile from https://hub.docker.com/_/zookeeper for 3.6 
> branch
> As a part of our testing, we have restarted each of the zookeeper nodes and 
> have seen the following behaviour:
> zoo1, zoo2, and zoo3 healthy (zoo1 is leader)
> We started our testing at approximately Sep 17 13:01:05 in the attached 
> docker-containers.log files.
> 1. (simulate patching zoo2)
> - restart zoo2
> - zk_synced_followers 1
> - zoo1 leader
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - waited 5 minutes with no change
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 1
> - zoo2 unhealthy (This ZooKeeper instance is not currently serving requests)
> - zoo3 healthy
> - restart zoo2
> - no changes
> - restart zoo3
> - zoo1 leader
> - zk_synced_followers 2
> - zoo2 healthy
> - zoo3 unhealthy (This ZooKeeper instance is not currently serving requests)
> - waited 5 minutes and zoo3 returned to healthy
> 2. simulate patching zoo3
> - zoo1 leader
> - restart zoo3
> - zk_synced_followers 2
> - zoo1, zoo2, and zoo3 healthy
> 3. simulate patching zoo1
> - zoo1 leader
> - restart zoo1
> - zoo1, zoo2, and zoo3 unhealthy (This ZooKeeper instance is not currently 
> serving requests)
> - waited 5 minutes to see if they resolve Sep 17 14:39 - Sep 17 14:44
> - tried restarting in this order: zoo2, zoo3, zoo1 and no change; all still 
> unhealthy (this step was not collected in the log files).
> The third case in the above scenarios is the critical one since we are no 
> longer able to start any of the zk nodes.
>  
> [~maoling] this issue may relate to 
> https://issues.apache.org/jira/browse/ZOOKEEPER-3920 which corresponds to the 
> first and second cases above that I am working with [~blb93] on.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ZOOKEEPER-3940) Zookeeper restart of leader causes all zk nodes to not serve requests

Reply via email to