Stephen submitted https://issues.apache.org/jira/browse/AMQ-6095 to capture this bug.
On Fri, Dec 18, 2015 at 8:13 AM, glstephen <glstep...@gmail.com> wrote: > I have encountered an issue with ActiveMQ where the entire cluster will > fail > when the master Zookeeper node goes offline. > > We have a 3-node ActiveMQ cluster setup in our development environment. > Each > node has ActiveMQ 5.12.0 and Zookeeper 3.4.6 (*note, we have done some > testing with Zookeeper 3.4.7, but this has failed to resolve the issue. > Time > constraints have so far prevented us from testing ActiveMQ 5.13). > > What we have found is that when we stop the master ZooKeeper process (via > the "end process tree" command in Task Manager), the remaining two > ZooKeeper > nodes continue to function as normal. Sometimes the ActiveMQ cluster is > able > to handle this, but sometimes it does not. > > When the cluster fails, we typically see this in the ActiveMQ log: > > 2015-12-18 09:08:45,157 | WARN | Too many cluster members are connected. > Expected at most 3 members but there are 4 connected. | > org.apache.activemq.leveldb.replicated.MasterElector | > WrapperSimpleAppMain-EventThread > ... > ... > 2015-12-18 09:27:09,722 | WARN | Session 0x351b43b4a560016 for server > null, > unexpected error, closing socket connection and attempting reconnect | > org.apache.zookeeper.ClientCnxn | > WrapperSimpleAppMain-SendThread(192.168.0.10:2181) > java.net.ConnectException: Connection refused: no further information > at sun.nio.ch.SocketChannelImpl.checkConnect(Native > Method)[:1.7.0_79] > at sun.nio.ch.SocketChannelImpl.finishConnect(Unknown > Source)[:1.7.0_79] > at > > org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)[zookeeper-3.4.6.jar:3.4.6-1569965] > at > > org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)[zookeeper-3.4.6.jar:3.4.6-1569965] > > We were immediately concerned by the fact that (A)ActiveMQ seems to think > there are four members in the cluster when it is only configured with 3 and > (B) when the exception is raised, the server appears to be null. We then > increased ActiveMQ's logging level to DEBUG in order to display the list of > members: > > 2015-12-18 09:33:04,236 | DEBUG | ZooKeeper group changed: Map(localhost -> > > ListBuffer((0000000156,{"id":"localhost","container":null,"address":null,"position":-1,"weight":5,"elected":null}), > > (0000000157,{"id":"localhost","container":null,"address":null,"position":-1,"weight":1,"elected":null}), > (0000000158,{"id":"localhost","container":null,"address":"tcp:// > 192.168.0.11:61619","position":-1,"weight":10,"elected":null}), > > (0000000159,{"id":"localhost","container":null,"address":null,"position":-1,"weight":10,"elected":null}))) > | org.apache.activemq.leveldb.replicated.MasterElector | ActiveMQ > BrokerService[localhost] Task-14 > > Can anyone suggest why this may be happening and/or suggest a way to > resolve > this? Our configurations are shown below: > > *ZooKeeper:* > tickTime=2000 > dataDir=C:\\zookeeper-3.4.7\\data > clientPort=2181 > initLimit=5 > syncLimit=2 > server.1=192.168.0.10:2888:3888 > server.2=192.168.0.11:2888:3888 > server.3=192.168.0.12:2888:3888 > > *ActiveMQ (server.1):* > <persistenceAdapter> > <replicatedLevelDB > directory="activemq-data" > replicas="3" > bind="tcp://0.0.0.0:61619" > zkAddress="192.168.0.11:2181,192.168.0.10:2181,192.168.0.12:2181" > zkPath="/activemq/leveldb-stores" > hostname="192.168.0.10" > weight="5"/> > //server.2 has a weight of 10, server.3 has a weight of 1 > </persistenceAdapter> > > > > -- > View this message in context: > http://activemq.2283324.n4.nabble.com/ActiveMQ-cluster-fails-with-server-null-when-the-Zookeeper-master-node-goes-offline-tp4705165.html > Sent from the ActiveMQ - User mailing list archive at Nabble.com. >