Thanks for the feedback and questions.

I hadn't considered any of the buffering/flushing/synchronization aspects
of the underlying NFS configuration, but you're absolutely right that the
guidelines for how to configure this solution need to acknowledge the
interplay between the two sets of settings and provide guidelines for how
to select settings for each one in order to achieve the goals.  I have no
experience with configuring NFS, so although I understand the theory that
writes can be deferred, it would be great to have people who understand
those aspects be involved when that documentation gets written.

Re: the type of lock to be used in the solution I'm using, I have no idea
(due to lack of experience with NFS other than as a user) and would
appreciate suggestions from those who are more knowledgeable about NFS.  I
had assumed we'd use the same type of lock as is currently being used (with
the current behavior in the face of network failures), but if another type
would be more appropriate, that would be great to know about.

Re: the master re-reading the data file, I wasn't planning on it but it
could be done.  What I proposed would have a separate thread checking the
content of the file on the same periodicity (but not necessarily
synchronized so not necessarily occurring at the same time) as the writing
thread.  If locking is working correctly and network access is reasonably
stable, if another process has written to the lock file (such that the
write thread would see a difference just before it wrote), then the write
would fail if the other process still held the lock, and it would succeed
(as you'd want it to) if the other process had lost the lock due to losing
network access.  So under normal conditions, there wouldn't be a need to
check content before writing.  The only scenario I can think of where it
might gain you something is if network access is winking in and out rapidly
for the different processes, so they get writes in and then immediately
lose the lock and someone else gets a write in and immediately loses the
lock.  If that's happening, and if the reading thread is timed so that it
reads the file after the write thread writes to the file and before the
other processes write to the file during this cycle, then multiple
processes could think each think they've got the lock long enough to become
master.  That seems like a very unlikely scenario, but easy enough to guard
against by doing the read-before-write that you asked about, so I think
it's worth doing.

Tim

On Fri, Jun 19, 2015 at 10:19 AM, James A. Robinson <j...@highwire.org>
wrote:

> On Mon, Jun 15, 2015 at 7:08 AM Tim Bain <tb...@alumni.duke.edu> wrote:
> >
> > It seems pretty clear that the assumption that acquiring a single file
> lock
> > without doing any further checks will provide thread-safety in all cases
> is
> > not an accurate one.
> >
> > As I see it, here are the failures of the existing approach:
> >
> >    - Failure #1 is that when an NFS failure occurs, the master broker
> never
> >    recognizes that an NFS failure has occurred and never recognizes that
> the
> >    slave broker has replaced it as the master.  The broker needs to
> identify
> >    those things even when it has no other reason to write to NFS.
> >
> >    - Failure #2 is that the slave broker believes that it can immediately
> >    become the master.  This wouldn't be a problem if the master broker
> >    instantaneously recognized that it has been supplanted and immediately
> >    ceded control, but assuming anything happens instantaneously
> (especially
> >    across multiple machines) is pretty unrealistic.  This means there
> will be
> >    a period of unresponsiveness when a failover occurs.
> >
> >    - Failure #3 is that once the master recognizes that it no longer is
> the
> >    master, it needs to abort all pending writes (in a way that guarantees
> that
> >    the data files will not be corrupted if NFS returns when some have
> been
> >    cancelled and others have not).
>
> I don't know if this has been called out already, but it will
> important for users to coordinate how their NFS is configured.  For
> example, activating asynchronous writes on the NFS side would
> probably make a mess of any assumptions we can make from the client
> side.  I also wonder how buffering might affect how a heartbeat
> would have to work, whether or not mtime won't get propogated until
> enough data has been written to cause a flush to disk.
>
> > I've got a proposed solution that I think will address all of these
> > failures, but hopefully others will see ways to improve it. Here's what
> > I've got in mind:
> >
> >    1. It is no longer sufficient to hold an NFS lock on the DB lock file.
> >    In order to maintain master status, you must successfully write to the
> DB
> >    lock file within some time period. If you fail to do so within that
> time
> >    period, you must close all use of the DB files and relinquish master
> status
> >    to another broker.
> >
> >    2. When the master shuts down during normal NFS circumstances, it will
> >    delete the DB lock file.  If at any point a slave broker sees that
> there is
> >    no DB lock file or that the DB lock file is so stale that the master
> must
> >    have shut down (more on that later), it may immediately attempt to
> create
> >    one and begin writing to it.  If that write succeeds, it is the
> master.
>
> Random curiousity, is this a network lock manager (NLM) based lock?
>
> >    3. All brokers should determine whether the current master is still
> >    alive by checking the current content of the DB lock file against the
> >    content read the last time you checked, rather than simply locking the
> file
> >    and assuming that tells you who's got ownership. This means that the
> >    current master needs to update some content in the DB lock file to
> make it
> >    unique each on each write; I propose that the content of the file be
> the
> >    broker's host, the broker's PID, the current local time on the broker
> at
> >    the time it did its write, and a UUID that will guarantee uniqueness
> of the
> >    content from write to write even in the face of time issues.  Note
> that
> >    only the UUID is actually required for this algorithm to work, but I
> think
> >    that having the other information will make it easier to troubleshoot.
>
> That sounds good.
>
> >    4. Because time can drift between machines, it is not sufficient to
> >    compare the write date on the DB lock file with your host's current
> time
> >    when determining that a file is stale; you must successfully read the
> file
> >    repeatedly over a time period and receive the same value each time in
> order
> >    to decide that the DB lock file is stale.
> >
> >    5. The master should use the same approach as the slaves to determine
> if
> >    it's still in control, by checking for changes to the content of the
> DB
> >    lock file. This means the master needs to positively confirm that each
> >    periodic write to the DB lock file succeeded by reading it back (in a
> >    separate thread, using a timeout on the read operation to identify
> >    situations where NFS doesn't respond), rather than simply assuming
> that its
> >    call to write() worked successfully.
> >
> >    6. When a slave determines that the master has failed to write to the
> DB
> >    lock file for longer than the timeout, it attempts to acquire the
> write
> >    lock on the DB lock file and write to it to become the master.  If it
> >    succeeds (because the master has lost NFS connectivity but the slave
> has
> >    not), it is the master.  If it fails because another slave acquired
> the
> >    lock first or because the slave has lost NFS connectivity, it goes
> back to
> >    monitoring for the master to fail to write to the DB lock file.
>
>
> Would the master re-read the data file again just before the next
> write as a guard against some other process somehow snagging the
> lock out from under it?
>
> Jim
>

Reply via email to