Anuj Have a look at https://issues.apache.org/jira/browse/AMQ-5549
You cannot avoid to have either two or zero master-brokers online during a failover. The question is how long this situation lasts (see Arthur Naseef's comment on AMQ-5549). In my failover-tests with NFS shared storage I was able to reproduce very different scenarios: - the former master broker *never* shuts down - the former master broker shuts down after 15 minutes - the former master broker shuts down after 20 seconds The only difference between these scenarios were NFS settings. My overall impression is that the failover only works with a highly available shared storage. As soon as one or multiple brokers lose the NFS connection, the situation is getting crazy and I even "managed" it to corrupt the persistence store during my tests. Also notice the both-brokers-down-problem ( https://issues.apache.org/jira/browse/AMQ-5568) that I discovered during my tests. Cheers Stephan On Thu, Apr 30, 2015 at 2:40 PM, Tim Bain <tb...@alumni.duke.edu> wrote: > An NFS problem was the first thing I thought of when I saw out-of-order log > lines, especially since you've had that problem before. And this outage > lasted for over two minutes (which doesn't count as "slow" in my book; > that's "unavailable" or "down" to me), which is pretty crazy; hopefully > your ops team has looked into how that happened and taken steps to ensure > it doesn't happen again. > > A NFS outage does justify a failover to the backup broker; to understand > why, think about what prevents failover during normal operation. The > master broker holds a file system lock on a DB lock file, and the slave > broker tries repeatedly to acquire the same lock. As long as it can't, it > knows the master broker is up and it can't become the master; at the point > where the lock disappears because the master broker can't access NFS, the > slave becomes active (at least, if it can access NFS; if not, then it > doesn't know that it could become active and it can't read the messages > from disk anyway). This is exactly what you would want to happen. > > The real problem here is the one in your last paragraph: when the slave > acquires the lock because the master can't access NFS, the master isn't > detecting that and becoming the slave. I'd suggest you try to recreate > this failure (in a dev environment) by causing the master broker to be > unable to access NFS and confirming that the master remains active even > after the slave becomes the master. Assuming that happens, submit a JIRA > bug report to describe the problem. Make sure you provide lots of details > about your NFS setup (include version numbers, file system type, etc.) and > about the O/Ses of the machines the brokers run on, since the behavior > might vary based on some of those things and you want to make sure that > whoever investigates this can reproduce it. But make sure you can > reproduce it first. > > Tim > Hi, > > I got the logs in this order only and after further checking the system I > got to know that NFS(where we put kahadb and broker logs) was slow during > that time. > > I can understand the delay in logs or I/O operations are slow during that > time but it does not justify why failover also open it's transport > connector. > > The main concern here is that the (master-slave-shared-storage)topology is > broken which should not happen in any case. If I/O operations are not > happening, master broker should stop and let the failover serve the clients > but here master didn't stop and both opened the connector. > > Thanks, > Anuj > > > > > -- > View this message in context: > > http://activemq.2283324.n4.nabble.com/ActiveMQ-master-slave-topology-issue-BUG-tp4695677p4695731.html > Sent from the ActiveMQ - User mailing list archive at Nabble.com. >