Hi Tim First we had hard NFS mounts (seems to be a common recommendation for NFS). When the NFS connection was interrupted for some minutes (more than the NFS lock timeout), the former slave broker became master as soon as the NFS connection was ok again. In contrast the former master broker *never* realized that he has lost the NFS lock (perhaps because the man pages say that NFS calls on hard mounts have no timeout). Therefore we ran two or three times into double-master-scenarios.
Due to these problems, we began to experiment with NFS settings. In our test cases we just disconnected and reconnected the master broker from NFS (with iptables). We switched to soft mounts and that was a first breakthrough. After restoring the NFS connection, the former master realized after about 10 to 15 minutes that he has lost the lock and shut down. We also reduced "timeo" and "retrans" so that NFS failures are propagated faster. The next massive change was to disable attribute caching ("noac" setting). With this it took only about 20 seconds until the former master broker recognized the failure situation and shut down. Perfect, BUT you can find a lot of warnings that this setting degrades performance. And even worse, some of my tests resulted in corrupted persistence store. The crazy thing is that NFS settings are almost never mentioned when the shared storage failover of AMQ is described (in blogs etc). Most articles and examples about it just shut down the master broker and the slave becomes master. In such simple scenarios our problems obviously never arise and the failover works just perfect. But as soon as the NFS connection get lost for some time, the outcome is quite unpredictable. And this is NOT a problem of ActiveMQ. The failover of AMQ works just perfect, but it can only work if it is able to recognize the failure situation. Stephan On Fri, Jun 12, 2015 at 11:47 PM, Tim Bain <tb...@alumni.duke.edu> wrote: > Stephan, can you describe which NFS settings resulted in which behavior? > On Jun 12, 2015 8:34 AM, "Stephan Burkard" <sburk...@gmail.com> wrote: > > > Anuj > > > > Have a look at https://issues.apache.org/jira/browse/AMQ-5549 > > > > You cannot avoid to have either two or zero master-brokers online during > a > > failover. The question is how long this situation lasts (see Arthur > > Naseef's comment on AMQ-5549). > > > > In my failover-tests with NFS shared storage I was able to reproduce very > > different scenarios: > > - the former master broker *never* shuts down > > - the former master broker shuts down after 15 minutes > > - the former master broker shuts down after 20 seconds > > > > The only difference between these scenarios were NFS settings. My overall > > impression is that the failover only works with a highly available shared > > storage. As soon as one or multiple brokers lose the NFS connection, the > > situation is getting crazy and I even "managed" it to corrupt the > > persistence store during my tests. > > > > Also notice the both-brokers-down-problem ( > > https://issues.apache.org/jira/browse/AMQ-5568) that I discovered during > > my > > tests. > > > > Cheers > > Stephan > > > > > > On Thu, Apr 30, 2015 at 2:40 PM, Tim Bain <tb...@alumni.duke.edu> wrote: > > > > > An NFS problem was the first thing I thought of when I saw out-of-order > > log > > > lines, especially since you've had that problem before. And this > outage > > > lasted for over two minutes (which doesn't count as "slow" in my book; > > > that's "unavailable" or "down" to me), which is pretty crazy; hopefully > > > your ops team has looked into how that happened and taken steps to > ensure > > > it doesn't happen again. > > > > > > A NFS outage does justify a failover to the backup broker; to > understand > > > why, think about what prevents failover during normal operation. The > > > master broker holds a file system lock on a DB lock file, and the slave > > > broker tries repeatedly to acquire the same lock. As long as it can't, > > it > > > knows the master broker is up and it can't become the master; at the > > point > > > where the lock disappears because the master broker can't access NFS, > the > > > slave becomes active (at least, if it can access NFS; if not, then it > > > doesn't know that it could become active and it can't read the messages > > > from disk anyway). This is exactly what you would want to happen. > > > > > > The real problem here is the one in your last paragraph: when the slave > > > acquires the lock because the master can't access NFS, the master isn't > > > detecting that and becoming the slave. I'd suggest you try to recreate > > > this failure (in a dev environment) by causing the master broker to be > > > unable to access NFS and confirming that the master remains active even > > > after the slave becomes the master. Assuming that happens, submit a > JIRA > > > bug report to describe the problem. Make sure you provide lots of > > details > > > about your NFS setup (include version numbers, file system type, etc.) > > and > > > about the O/Ses of the machines the brokers run on, since the behavior > > > might vary based on some of those things and you want to make sure that > > > whoever investigates this can reproduce it. But make sure you can > > > reproduce it first. > > > > > > Tim > > > Hi, > > > > > > I got the logs in this order only and after further checking the > system I > > > got to know that NFS(where we put kahadb and broker logs) was slow > during > > > that time. > > > > > > I can understand the delay in logs or I/O operations are slow during > that > > > time but it does not justify why failover also open it's transport > > > connector. > > > > > > The main concern here is that the (master-slave-shared-storage)topology > > is > > > broken which should not happen in any case. If I/O operations are not > > > happening, master broker should stop and let the failover serve the > > clients > > > but here master didn't stop and both opened the connector. > > > > > > Thanks, > > > Anuj > > > > > > > > > > > > > > > -- > > > View this message in context: > > > > > > > > > http://activemq.2283324.n4.nabble.com/ActiveMQ-master-slave-topology-issue-BUG-tp4695677p4695731.html > > > Sent from the ActiveMQ - User mailing list archive at Nabble.com. > > > > > >