It seems pretty clear that the assumption that acquiring a single file lock without doing any further checks will provide thread-safety in all cases is not an accurate one.
As I see it, here are the failures of the existing approach: - Failure #1 is that when an NFS failure occurs, the master broker never recognizes that an NFS failure has occurred and never recognizes that the slave broker has replaced it as the master. The broker needs to identify those things even when it has no other reason to write to NFS. - Failure #2 is that the slave broker believes that it can immediately become the master. This wouldn't be a problem if the master broker instantaneously recognized that it has been supplanted and immediately ceded control, but assuming anything happens instantaneously (especially across multiple machines) is pretty unrealistic. This means there will be a period of unresponsiveness when a failover occurs. - Failure #3 is that once the master recognizes that it no longer is the master, it needs to abort all pending writes (in a way that guarantees that the data files will not be corrupted if NFS returns when some have been cancelled and others have not). I've got a proposed solution that I think will address all of these failures, but hopefully others will see ways to improve it. Here's what I've got in mind: 1. It is no longer sufficient to hold an NFS lock on the DB lock file. In order to maintain master status, you must successfully write to the DB lock file within some time period. If you fail to do so within that time period, you must close all use of the DB files and relinquish master status to another broker. 2. When the master shuts down during normal NFS circumstances, it will delete the DB lock file. If at any point a slave broker sees that there is no DB lock file or that the DB lock file is so stale that the master must have shut down (more on that later), it may immediately attempt to create one and begin writing to it. If that write succeeds, it is the master. 3. All brokers should determine whether the current master is still alive by checking the current content of the DB lock file against the content read the last time you checked, rather than simply locking the file and assuming that tells you who's got ownership. This means that the current master needs to update some content in the DB lock file to make it unique each on each write; I propose that the content of the file be the broker's host, the broker's PID, the current local time on the broker at the time it did its write, and a UUID that will guarantee uniqueness of the content from write to write even in the face of time issues. Note that only the UUID is actually required for this algorithm to work, but I think that having the other information will make it easier to troubleshoot. 4. Because time can drift between machines, it is not sufficient to compare the write date on the DB lock file with your host's current time when determining that a file is stale; you must successfully read the file repeatedly over a time period and receive the same value each time in order to decide that the DB lock file is stale. 5. The master should use the same approach as the slaves to determine if it's still in control, by checking for changes to the content of the DB lock file. This means the master needs to positively confirm that each periodic write to the DB lock file succeeded by reading it back (in a separate thread, using a timeout on the read operation to identify situations where NFS doesn't respond), rather than simply assuming that its call to write() worked successfully. 6. When a slave determines that the master has failed to write to the DB lock file for longer than the timeout, it attempts to acquire the write lock on the DB lock file and write to it to become the master. If it succeeds (because the master has lost NFS connectivity but the slave has not), it is the master. If it fails because another slave acquired the lock first or because the slave has lost NFS connectivity, it goes back to monitoring for the master to fail to write to the DB lock file. 7. If the master finds that it has failed to write to the DB lock file for some amount of time (which must be less than the overall timeout), it shuts down its use of the database files in a way that will avoid corrupting the data files no matter how NFS connectivity comes and goes during the shutdown process. The master must not write to the DB lock file once it begins shutting down even if NFS connectivity returns during the shutdown, because that would indicate to the slaves that it is alive and retaining ownership, which is not the case when shutting down. Note: this may be difficult or impossible to do, and I don't know either KahaDB or NFS at all to know how possible this is or what other options we could use, so please give feedback for how best to achieve this goal. 8. Since we can't rely on the former master having any NFS/network connectivity and being able to write any information to indicate that shutdown completed, I think there has to be a time limit ("the master broker must complete its shutdown within X amount of time, after which the slaves can assume they may take ownership of the data files"), which should probably be configurable so you can tune that window based on how quickly your particular broker, running under your particular load on your particular hardware, can shut down. To facilitate setting that value effectively, we should probably log how long shutdown took, so that users can test that and set their timeout appropriately. 9. The system must enforce that the master's inactivity identification interval plus the time to shutdown <= the amount of time a slave must see the DB lock file be stale before becoming the master. 10. Once the master has completed shutting down, the slaves can attempt to become the master by successfully writing to the DB lock file, which will allow them to take ownership of the data files and begin accepting connections. The new master must confirm that it owns the DB lock file (either with an explicit NFS lock test, if such things exist, or by confirming that the current content of the file is the content that it wrote most recently) before accepting connections or beginning to write to the data files. Thoughts? Suggestions? One question I still had was whether it was actually realistic to think that we can always safely cancel the threads of the master broker without risking corrupting the data files. Should we make a copy of the data files (which has performance and storage implications) and have the new master work from those files instead of the ones the old master will be writing to, so they can't step on each other? Tim > Anuj > > Have a look at https://issues.apache.org/jira/browse/AMQ-5549 > > You cannot avoid to have either two or zero master-brokers online during a > failover. The question is how long this situation lasts (see Arthur > Naseef's comment on AMQ-5549). > > In my failover-tests with NFS shared storage I was able to reproduce very > different scenarios: > - the former master broker *never* shuts down > - the former master broker shuts down after 15 minutes > - the former master broker shuts down after 20 seconds > > The only difference between these scenarios were NFS settings. My overall > impression is that the failover only works with a highly available shared > storage. As soon as one or multiple brokers lose the NFS connection, the > situation is getting crazy and I even "managed" it to corrupt the > persistence store during my tests. > > Also notice the both-brokers-down-problem ( > https://issues.apache.org/jira/browse/AMQ-5568) that I discovered during > my > tests. > > Cheers > Stephan > > On Thu, Apr 30, 2015 at 2:40 PM, Tim Bain <tb...@alumni.duke.edu> wrote: > > > An NFS problem was the first thing I thought of when I saw out-of-order > log > > lines, especially since you've had that problem before. And this outage > > lasted for over two minutes (which doesn't count as "slow" in my book; > > that's "unavailable" or "down" to me), which is pretty crazy; hopefully > > your ops team has looked into how that happened and taken steps to ensure > > it doesn't happen again. > > > > A NFS outage does justify a failover to the backup broker; to understand > > why, think about what prevents failover during normal operation. The > > master broker holds a file system lock on a DB lock file, and the slave > > broker tries repeatedly to acquire the same lock. As long as it can't, > it > > knows the master broker is up and it can't become the master; at the > point > > where the lock disappears because the master broker can't access NFS, the > > slave becomes active (at least, if it can access NFS; if not, then it > > doesn't know that it could become active and it can't read the messages > > from disk anyway). This is exactly what you would want to happen. > > > > The real problem here is the one in your last paragraph: when the slave > > acquires the lock because the master can't access NFS, the master isn't > > detecting that and becoming the slave. I'd suggest you try to recreate > > this failure (in a dev environment) by causing the master broker to be > > unable to access NFS and confirming that the master remains active even > > after the slave becomes the master. Assuming that happens, submit a JIRA > > bug report to describe the problem. Make sure you provide lots of > details > > about your NFS setup (include version numbers, file system type, etc.) > and > > about the O/Ses of the machines the brokers run on, since the behavior > > might vary based on some of those things and you want to make sure that > > whoever investigates this can reproduce it. But make sure you can > > reproduce it first. > > > > Tim > > Hi, > > > > I got the logs in this order only and after further checking the system I > > got to know that NFS(where we put kahadb and broker logs) was slow during > > that time. > > > > I can understand the delay in logs or I/O operations are slow during that > > time but it does not justify why failover also open it's transport > > connector. > > > > The main concern here is that the (master-slave-shared-storage)topology > is > > broken which should not happen in any case. If I/O operations are not > > happening, master broker should stop and let the failover serve the > clients > > but here master didn't stop and both opened the connector. > > > > Thanks, > > Anuj > > > > > > > > > > -- > > View this message in context: > > > > > http://activemq.2283324.n4.nabble.com/ActiveMQ-master-slave-topology-issue-BUG-tp4695677p4695731.html > > Sent from the ActiveMQ - User mailing list archive at Nabble.com. > > >