Hi folks - just wondering if anyone else has tested this and found similar problems.
I've been testing ActiveMQ in a shared storage master/slave configuration, using an NFSv4 server for the shared storage. I've tried this both with a standalone nfs server, and using Amazon's EFS server. My tests are looking into what happens when the network is unreliable - specifically, if for some reason the master ActiveMQ broker can't communicate with the NFS server. What I've been seeing, in a nutshell, is the following: - At startup, the Master gets exclusive access to the NFS lock file, and the Slave doesn't, and it loops waiting for the lock, as expected. - When I cut the Master off from the NFS server, the NFS server eventually times out the lock, and the Slave acquires it and starts up. It gets a pile of journal errors, but it does eventually sort things out and start, and clients using the failover: protocol start sending messages to the slace. - Eventually, the Master notices that it is broken and tries to shut down. It takes a long time - I get a lot of warnings like: [KeepAlive Timer] INFO TransportConnection - The connection to 'tcp://10.0.12.209:42150' is taking a long time to shutdown. ... I'm guessing it's trying to gracefully shut down a listener or something? Anyway, eventually I get a DB failure and it dies. The problem though, is that the Master re-starts itself - as it should. And in the meantime I've repaired the connection to the NFS server. So the master should now try to grab the exclusive lock and fail, and become a slave instead. However, this generally doesn't seem to happen. The master restarts, with no lock errors, and I have two brokers both thinking they own the same NFS-based database. Not a good situation. (Once, I had a situation where the master did seem to block waiting for a lock, but I haven't been able to reproduce that behaviour) Has anyone else seen this? None of this would affect a situation where the master broker crashed or was restarted - that should be fine - but it seems quite unreliable when a network split occurs, at least from our testing so far. Note that this may be related to a problem with Java and exclusive file locks, which I raised the other day on Stack Overflow: http://stackoverflow.com/questions/38397559/is-there-any-way-to-tell-if-a-java-exclusive-filelock-on-an-nfs-share-is-really the TL;DR is that the FileLock.isValid() check that is used in org.apache.activemq.util.LockFile.keepAlive() is pointless - it doesn't actually check that the lock is still valid, just that no other thread in the same JVM has killed the lock. However the LockFile.keepAlive code: public boolean keepAlive() { return lock != null && lock.isValid() && file.exists(); } ... should still fail, as file.exists() should fail if the NFS server has gone away. (though it's possible this will block rather than failing...) - Korny -- Kornelis Sietsma korny at my surname dot com http://korny.info .fnord { display: none !important; }