Unreliable NFS exclusive locks on unreliable networks

Korny Sietsma Fri, 22 Jul 2016 10:00:06 -0700

Hi folks - just wondering if anyone else has tested this and found similar
problems.

I've been testing ActiveMQ in a shared storage master/slave configuration,
using an NFSv4 server for the shared storage. I've tried this both with a
standalone nfs server, and using Amazon's EFS server.

My tests are looking into what happens when the network is unreliable -
specifically, if for some reason the master ActiveMQ broker can't
communicate with the NFS server.

What I've been seeing, in a nutshell, is the following:

- At startup, the Master gets exclusive access to the NFS lock file, and
the Slave doesn't, and it loops waiting for the lock, as expected.

- When I cut the Master off from the NFS server, the NFS server eventually
times out the lock, and the Slave acquires it and starts up. It gets a
pile of journal errors, but it does eventually sort things out and start,
and clients using the failover: protocol start sending messages to the
slace.

- Eventually, the Master notices that it is broken and tries to shut down.
It takes a long time - I get a lot of warnings like:
[KeepAlive Timer] INFO TransportConnection - The connection to
'tcp://10.0.12.209:42150' is taking a long time to shutdown.
... I'm guessing it's trying to gracefully shut down a listener or
something? Anyway, eventually I get a DB failure and it dies.

The problem though, is that the Master re-starts itself - as it should.
And in the meantime I've repaired the connection to the NFS server. So the
master should now try to grab the exclusive lock and fail, and become a
slave instead.

However, this generally doesn't seem to happen. The master restarts, with
no lock errors, and I have two brokers both thinking they own the same
NFS-based database. Not a good situation. (Once, I had a situation where
the master did seem to block waiting for a lock, but I haven't been able to
reproduce that behaviour)

Has anyone else seen this? None of this would affect a situation where the
master broker crashed or was restarted - that should be fine - but it seems
quite unreliable when a network split occurs, at least from our testing so
far.

Note that this may be related to a problem with Java and exclusive file
locks, which I raised the other day on Stack Overflow:
http://stackoverflow.com/questions/38397559/is-there-any-way-to-tell-if-a-java-exclusive-filelock-on-an-nfs-share-is-really

the TL;DR is that the FileLock.isValid() check that is used in
org.apache.activemq.util.LockFile.keepAlive() is pointless - it doesn't
actually check that the lock is still valid, just that no other thread in
the same JVM has killed the lock.

However the LockFile.keepAlive code:
public boolean keepAlive() {
return lock != null && lock.isValid() && file.exists();
}
... should still fail, as file.exists() should fail if the NFS server has
gone away. (though it's possible this will block rather than failing...)

- Korny

--
Kornelis Sietsma korny at my surname dot com http://korny.info
.fnord { display: none !important; }

Unreliable NFS exclusive locks on unreliable networks

Reply via email to