Thank you for the input. The way I originally set up the Replicated LevelDB servers was to create 2 sets of 3 hosts:
1a 1b 1c 2a 2b 2c So within the "1" and "2" clusters each had one master and two slaves, and then I networked the "1" and "2" clusters together, My intent was to create highly reliable (by way of having two independent masters) fault tolerant (by way of having 2 slaves per master) brokers. Because I no longer trust the zookeeper based logic that Hiram wrote, I stared exploring the NFS based dependency. I was originally reluctant to even consider NFS given my past experience with NFS issues on both Solaris and Linux machines. When I started testing the NFS based locking, I decided to run it across a slow network, just so I had an idea of what might happen if something happened to the network or the NFS server that caused it to slow down (e.g., I've noticed NetApps can have a sharp drop in performance if they have to rebuild a RAID disk). Because of the way Vagrant works it's proving difficult to debug the 2-active-broker scenario, since the default Vagrant network configuration uses the same NATed IP address per broker. I had been hoping to just run a packet capture and examine the sequence of LOCK calls between the servers. I've tabled that for now and have set up a local NFS share, and so far that does appear to be solid. I can randomly start and stop the master and I've not seen any instance where multiple servers become the master at the same time. I'll try and find time to get back to testing the original scenario, but for now I'll assume that as long as it's on a good LAN it's unlikely we'll see what I was seeing over a slow network. Jim On Tue, Mar 1, 2016 at 12:25 PM artnaseef <a...@artnaseef.com> wrote: > So 15 seconds sounds really low, although I'm not sure of all the various > timeout settings in NFS. > > Specifically here, the timeout of concern is the release of a lock held by > a > client. The higher the timeout, the less likelihood of two clients > obtaining the same lock, but the slower failover becomes because of the > added time to detect that the lock-holder is out-of-service. On the other > hand, the lower the timeout, the greater the chances of two clients getting > the same lock, but the faster failover becomes. > > I tend toward wanting reliability first, and fast recovery time second. > Although that really depends on the use-case. However, for use-cases that > require very fast recovery, I would look to non-persistent messaging, > eliminating the issue of storage locking, but then requiring an alternate > approach to handling potential message loss. > > The timeout here is one that must be handled by the server - whether the > server allows different clients to request different timeout values, I > don't > know - but it's the NFS server that must apply it, because it needs to take > effect when the client drops off the network unexpectedly. > > Hope this helps. > > BTW - if the two-active-broker scenario happens very commonly and the > network appears to be performing well, then I would look to the possibility > of the lock file getting removed somehow during startup. > > > > -- > View this message in context: > http://activemq.2283324.n4.nabble.com/question-for-users-of-NFS-master-slave-setups-tp4708204p4708776.html > Sent from the ActiveMQ - User mailing list archive at Nabble.com. >