It seems pretty clear that the assumption that acquiring a single file lock
without doing any further checks will provide thread-safety in all cases is
not an accurate one.

As I see it, here are the failures of the existing approach:

   - Failure #1 is that when an NFS failure occurs, the master broker never
   recognizes that an NFS failure has occurred and never recognizes that the
   slave broker has replaced it as the master.  The broker needs to identify
   those things even when it has no other reason to write to NFS.
   - Failure #2 is that the slave broker believes that it can immediately
   become the master.  This wouldn't be a problem if the master broker
   instantaneously recognized that it has been supplanted and immediately
   ceded control, but assuming anything happens instantaneously (especially
   across multiple machines) is pretty unrealistic.  This means there will be
   a period of unresponsiveness when a failover occurs.
   - Failure #3 is that once the master recognizes that it no longer is the
   master, it needs to abort all pending writes (in a way that guarantees that
   the data files will not be corrupted if NFS returns when some have been
   cancelled and others have not).

I've got a proposed solution that I think will address all of these
failures, but hopefully others will see ways to improve it. Here's what
I've got in mind:

   1. It is no longer sufficient to hold an NFS lock on the DB lock file.
   In order to maintain master status, you must successfully write to the DB
   lock file within some time period. If you fail to do so within that time
   period, you must close all use of the DB files and relinquish master status
   to another broker.
   2. When the master shuts down during normal NFS circumstances, it will
   delete the DB lock file.  If at any point a slave broker sees that there is
   no DB lock file or that the DB lock file is so stale that the master must
   have shut down (more on that later), it may immediately attempt to create
   one and begin writing to it.  If that write succeeds, it is the master.
   3. All brokers should determine whether the current master is still
   alive by checking the current content of the DB lock file against the
   content read the last time you checked, rather than simply locking the file
   and assuming that tells you who's got ownership. This means that the
   current master needs to update some content in the DB lock file to make it
   unique each on each write; I propose that the content of the file be the
   broker's host, the broker's PID, the current local time on the broker at
   the time it did its write, and a UUID that will guarantee uniqueness of the
   content from write to write even in the face of time issues.  Note that
   only the UUID is actually required for this algorithm to work, but I think
   that having the other information will make it easier to troubleshoot.
   4. Because time can drift between machines, it is not sufficient to
   compare the write date on the DB lock file with your host's current time
   when determining that a file is stale; you must successfully read the file
   repeatedly over a time period and receive the same value each time in order
   to decide that the DB lock file is stale.
   5. The master should use the same approach as the slaves to determine if
   it's still in control, by checking for changes to the content of the DB
   lock file. This means the master needs to positively confirm that each
   periodic write to the DB lock file succeeded by reading it back (in a
   separate thread, using a timeout on the read operation to identify
   situations where NFS doesn't respond), rather than simply assuming that its
   call to write() worked successfully.
   6. When a slave determines that the master has failed to write to the DB
   lock file for longer than the timeout, it attempts to acquire the write
   lock on the DB lock file and write to it to become the master.  If it
   succeeds (because the master has lost NFS connectivity but the slave has
   not), it is the master.  If it fails because another slave acquired the
   lock first or because the slave has lost NFS connectivity, it goes back to
   monitoring for the master to fail to write to the DB lock file.
   7. If the master finds that it has failed to write to the DB lock file
   for some amount of time (which must be less than the overall timeout), it
   shuts down its use of the database files in a way that will avoid
   corrupting the data files no matter how NFS connectivity comes and goes
   during the shutdown process.  The master must not write to the DB lock file
   once it begins shutting down even if NFS connectivity returns during the
   shutdown, because that would indicate to the slaves that it is alive and
   retaining ownership, which is not the case when shutting down. Note: this
   may be difficult or impossible to do, and I don't know either KahaDB or NFS
   at all to know how possible this is or what other options we could use, so
   please give feedback for how best to achieve this goal.
   8. Since we can't rely on the former master having any NFS/network
   connectivity and being able to write any information to indicate that
   shutdown completed, I think there has to be a time limit ("the master
   broker must complete its shutdown within X amount of time, after which the
   slaves can assume they may take ownership of the data files"), which should
   probably be configurable so you can tune that window based on how quickly
   your particular broker, running under your particular load on your
   particular hardware, can shut down.  To facilitate setting that value
   effectively, we should probably log how long shutdown took, so that users
   can test that and set their timeout appropriately.
   9. The system must enforce that the master's inactivity identification
   interval plus the time to shutdown <= the amount of time a slave must see
   the DB lock file be stale before becoming the master.
   10. Once the master has completed shutting down, the slaves can attempt
   to become the master by successfully writing to the DB lock file, which
   will allow them to take ownership of the data files and begin accepting
   connections.  The new master must confirm that it owns the DB lock file
   (either with an explicit NFS lock test, if such things exist, or by
   confirming that the current content of the file is the content that it
   wrote most recently) before accepting connections or beginning to write to
   the data files.

Thoughts?  Suggestions?

One question I still had was whether it was actually realistic to think
that we can always safely cancel the threads of the master broker without
risking corrupting the data files.  Should we make a copy of the data files
(which has performance and storage implications) and have the new master
work from those files instead of the ones the old master will be writing
to, so they can't step on each other?

Tim

> Anuj
>
> Have a look at https://issues.apache.org/jira/browse/AMQ-5549
>
> You cannot avoid to have either two or zero master-brokers online during a
> failover. The question is how long this situation lasts (see Arthur
> Naseef's comment on AMQ-5549).
>
> In my failover-tests with NFS shared storage I was able to reproduce very
> different scenarios:
> - the former master broker *never* shuts down
> - the former master broker shuts down after 15 minutes
> - the former master broker shuts down after 20 seconds
>
> The only difference between these scenarios were NFS settings. My overall
> impression is that the failover only works with a highly available shared
> storage. As soon as one or multiple brokers lose the NFS connection, the
> situation is getting crazy and I even "managed" it to corrupt the
> persistence store during my tests.
>
> Also notice the both-brokers-down-problem (
> https://issues.apache.org/jira/browse/AMQ-5568) that I discovered during
> my
> tests.
>
> Cheers
> Stephan
>
> On Thu, Apr 30, 2015 at 2:40 PM, Tim Bain <tb...@alumni.duke.edu> wrote:
>
> > An NFS problem was the first thing I thought of when I saw out-of-order
> log
> > lines, especially since you've had that problem before.  And this outage
> > lasted for over two minutes (which doesn't count as "slow" in my book;
> > that's "unavailable" or "down" to me), which is pretty crazy; hopefully
> > your ops team has looked into how that happened and taken steps to ensure
> > it doesn't happen again.
> >
> > A NFS outage does justify a failover to the backup broker; to understand
> > why, think about what prevents failover during normal operation.  The
> > master broker holds a file system lock on a DB lock file, and the slave
> > broker tries repeatedly to acquire the same lock.  As long as it can't,
> it
> > knows the master broker is up and it can't become the master; at the
> point
> > where the lock disappears because the master broker can't access NFS, the
> > slave becomes active (at least, if it can access NFS; if not, then it
> > doesn't know that it could become active and it can't read the messages
> > from disk anyway).  This is exactly what you would want to happen.
> >
> > The real problem here is the one in your last paragraph: when the slave
> > acquires the lock because the master can't access NFS, the master isn't
> > detecting that and becoming the slave.  I'd suggest you try to recreate
> > this failure (in a dev environment) by causing the master broker to be
> > unable to access NFS and confirming that the master remains active even
> > after the slave becomes the master.  Assuming that happens, submit a JIRA
> > bug report to describe the problem.  Make sure you provide lots of
> details
> > about your NFS setup (include version numbers, file system type, etc.)
> and
> > about the O/Ses of the machines the brokers run on, since the behavior
> > might vary based on some of those things and you want to make sure that
> > whoever investigates this can reproduce it.  But make sure you can
> > reproduce it first.
> >
> > Tim
> > Hi,
> >
> > I got the logs in this order only and after further checking the system I
> > got to know that NFS(where we put kahadb and broker logs) was slow during
> > that time.
> >
> > I can understand the delay in logs or I/O operations are slow during that
> > time but it does not justify why failover also open it's transport
> > connector.
> >
> > The main concern here is that the (master-slave-shared-storage)topology
> is
> > broken which should not happen in any case. If I/O operations are not
> > happening, master broker should stop and let the failover serve the
> clients
> > but here master didn't stop and both opened the connector.
> >
> > Thanks,
> > Anuj
> >
> >
> >
> >
> > --
> > View this message in context:
> >
> >
> http://activemq.2283324.n4.nabble.com/ActiveMQ-master-slave-topology-issue-BUG-tp4695677p4695731.html
> > Sent from the ActiveMQ - User mailing list archive at Nabble.com.
> >
>

Reply via email to