Chatted with Maxim and Bill, I think we figured it out

I think the confusion stems from the fact that there are two types of
pauses in this system, explicit, persisted pauses generated by the
pauseJobUpdate RPC and implicit, volatile pauses caused due to the absence
of a sufficiently fresh heartbeat (such as in the case of a network
partition).

In case a monitoring service detects a problem it should call the explicit
pauseJobUpdate RPC, which will cause a state change that requires an
explicit resumeJobUpdate RPC to resume. That feature already exists.

But, we need one more thing to make this reliable - heartbeats to protect
against network partitions between the scheduler and the monitoring
service. These can be volatile and lightweight - the scheduler just checks
for a sufficiently fresh heartbeat before it performs an update action, and
if none is present it simply refuses to perform the action. If the
partition heals a new heartbeat will arrive (if the update being monitored
should still be allowed to proceed) and the scheduler will allow the update
to proceed.


On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <wfar...@apache.org> wrote:

> I think we should assess that after building the rest of the feature.  IIUC
> the rest of the code doesn't care if the update is initially paused.
>
> -=Bill
>
> On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > Can we get a consensus here? Looks like the only sticky point left is
> > around starting an update in paused vs. non-paused state. I can argue
> > either way as it's easy to add later if needed.
> >
> > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <wfar...@apache.org> wrote:
> > > I'm not arguing against the merits of the approach.  Just feeling out
> > > whether that should be done _after_ the rest of the heartbeat support.
> > > Seems like it can be cleanly added at the end to get something usable
> > > earlier.
> > >
> > > -=Bill
> > >
> > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <kevi...@apache.org>
> > wrote:
> > >
> > >> I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy
> > >> signal, and punting on a more complex NACK signal (which we'd have to
> > >> reliably persist).
> > >>
> > >> I think the only disagreement in this thread is whether the default
> > state
> > >> for a new update should be running or waiting-for-heartbeat. I think
> > >> waiting for a heartbeat is not only a more correct implementation (no
> > risk
> > >> of acting after a failover but before the heartbeat timeout) but
> > simpler to
> > >> implement (initialize the PulseMonitor data structure as empty rather
> > than
> > >> with a synthetic heartbeat).
> > >>
> > >> From an API consumer perspective the sequence is:
> > >>
> > >> 1. API client sends a startUpdate RPC to the scheduler
> > >> 2. API client receives an OK response, then arranges for something to
> > call
> > >> heartbeat with that updateId on some interval
> > >> 3. Whatever is supposed to send heartbeats sends one immediately, then
> > >> starts sending them on some smaller interval
> > >>
> > >> Waiting for the first heartbeat ensures that this sequence has been
> > >> completed successfully, while not waiting for it only ensure that
> step 1
> > >> has happened.
> > >>
> > >>
> > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <wfar...@apache.org>
> > wrote:
> > >>
> > >> > Wait - simpler solution than what?  We're talking about not doing
> > either.
> > >> >
> > >> > -=Bill
> > >> >
> > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <kevi...@apache.org
> >
> > >> > wrote:
> > >> >
> > >> > > I think waiting for the first heartbeat before taking any action
> is
> > the
> > >> > > simpler solution here as it allows the implementation to be
> entirely
> > >> > > soft-state and still catches the bugs I described.
> > >> > >
> > >> > > The implementation is just PulseMonitorImpl<UpdateId> - heartbeat
> > calls
> > >> > > pulse and mutation operations check isAlive. I think the code
> might
> > >> > > actually work as-is.
> > >> > >
> > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko <
> > ma...@apache.org>
> > >> > > wrote:
> > >> > >
> > >> > > > Pausing update on creation seems like a logical approach when
> > dealing
> > >> > > > with inverted dependency model. I.e. updater is happy to act as
> > long
> > >> > > > as it's greenlighted by the external signal. It's also aligned
> > with a
> > >> > > > failover experience where coordinated updates are rehydrated in
> > >> paused
> > >> > > > state waiting for HB awakening. That said, I am OK punting it
> for
> > the
> > >> > > > sake of simplicity for now.
> > >> > > >
> > >> > > > Kevin?
> > >> > > >
> > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <
> wfar...@apache.org
> > >
> > >> > > wrote:
> > >> > > > > If the goal is to reduce complexity now and add features
> later,
> > why
> > >> > not
> > >> > > > > nuke both for now - kick off the update right away, and let
> > lack of
> > >> > > > > heartbeats serve as a uniform "unknown or unhealthy" signal?
> > >> > > > >
> > >> > > > > -=Bill
> > >> > > > >
> > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko <
> > >> ma...@apache.org
> > >> > >
> > >> > > > wrote:
> > >> > > > >
> > >> > > > >> I am still +1 on the idea to have default paused state on
> > >> creation.
> > >> > I
> > >> > > > >> think we could still differentiate between initially paused
> and
> > >> > timed
> > >> > > > >> out states internally by looking at pause reason. It's quite
> > >> > different
> > >> > > > >> if we want to store explicit NACK reasons from the external
> > >> service
> > >> > > > >> though. That would require persistence and a bit more
> > complicated
> > >> > > > >> logic.
> > >> > > > >>
> > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney <
> > >> kevi...@apache.org>
> > >> > > > wrote:
> > >> > > > >> > I like the idea of implementing this scheduler-side purely
> > >> through
> > >> > > > >> volatile
> > >> > > > >> > state, but the lack of feedback (generic vs specific error
> > >> > messages
> > >> > > > when
> > >> > > > >> an
> > >> > > > >> > update is paused) leaves something to be desired. Maybe we
> > can
> > >> > > address
> > >> > > > >> that
> > >> > > > >> > with a metadata field in the initial call to startUpdate
> > (with
> > >> an
> > >> > > > >> optional
> > >> > > > >> > link to a page where one can get more rich information
> about
> > the
> > >> > > > state of
> > >> > > > >> > the monitor sending/not sending heartbeats).
> > >> > > > >> >
> > >> > > > >> > The main drawback is that we may have to wait a maximum of
> > one
> > >> > > > heartbeat
> > >> > > > >> > interval to find out that an update should be paused.
> > >> > > > >> >
> > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <
> > >> > > ma...@apache.org>
> > >> > > > >> wrote:
> > >> > > > >> >
> > >> > > > >> >> The main reason I preferred the lack-of-ACK approach over
> an
> > >> > > explicit
> > >> > > > >> >> NACK one is simplicity. As Joshua pointed out there is
> more
> > >> state
> > >> > > to
> > >> > > > >> >> handle in that case. The lack-of-ACK model can be
> completely
> > >> > > > >> >> implemented in volatile memory sidestepping the persistent
> > >> > storage
> > >> > > > >> >> entirely. With the NACK we would need to reliably persist
> > >> > external
> > >> > > > >> >> service call reasons to survive scheduler failovers. Not a
> > huge
> > >> > > > >> >> challenge but something to keep in mind.
> > >> > > > >> >>
> > >> > > > >> >> I still think the simplicity/reliability tradeoff is
> > acceptable
> > >> > > here
> > >> > > > >> >> if we rely on external service to abort heartbeats in case
> > of a
> > >> > > > health
> > >> > > > >> >> alert fired. This can be explicitly documented as an
> > external
> > >> > > > >> >> integration requirement. However, If the consensus is to
> go
> > a
> > >> > more
> > >> > > > >> >> reliable (though more complicated) NACK route I am happy
> to
> > >> > > > reconsider
> > >> > > > >> >> the current proposal.
> > >> > > > >> >>
> > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <
> > >> > > > jco...@twopensource.com>
> > >> > > > >> >> wrote:
> > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we
> don't
> > >> > have a
> > >> > > > >> NACK.
> > >> > > > >> >> If
> > >> > > > >> >> > we are going to let lack-of-ACK serve as the NACK, i
> don't
> > >> > think
> > >> > > > it's
> > >> > > > >> >> safe
> > >> > > > >> >> > to resume when we receive another ACK.  In other words,
> a
> > >> > service
> > >> > > > >> >> toggling
> > >> > > > >> >> > unhealthy might not be deemed safe to proceed."
> > >> > > > >> >> >
> > >> > > > >> >> > Lack-of-ACK is the scenario where connectivity between
> the
> > >> > > monitor
> > >> > > > and
> > >> > > > >> >> the
> > >> > > > >> >> > scheduler is unavailable. Shouldn't the NACK scenario
> > >> > (everything
> > >> > > > is
> > >> > > > >> not
> > >> > > > >> >> > ok!) be handled by the monitoring service triggering an
> > >> > explicit
> > >> > > > >> pause?
> > >> > > > >> >> > I.e. section 2 should be updated to say "External
> service
> > >> > detects
> > >> > > > >> service
> > >> > > > >> >> > health problems and pauses the update" and section 4
> > becomes
> > >> > the
> > >> > > > >> current
> > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not be received the
> > >> > scheduler
> > >> > > > >> pauses
> > >> > > > >> >> > the update.").
> > >> > > > >> >> >
> > >> > > > >> >> > I agree that it's unsafe to to resume updates after
> > >> receiving a
> > >> > > > >> heartbeat
> > >> > > > >> >> > after previously pausing due to a missed heartbeat. In
> > that
> > >> > > > scenario
> > >> > > > >> I'd
> > >> > > > >> >> > think we'd want an explicit resumeJobUpdate. If the
> > scenario
> > >> > > we're
> > >> > > > >> trying
> > >> > > > >> >> > to handle is *never* received a heartbeat, that's a
> > separate
> > >> > > > matter,
> > >> > > > >> in
> > >> > > > >> >> > that case unpausing upon receiving the first heartbeat
> > would
> > >> > make
> > >> > > > >> sense,
> > >> > > > >> >> > but it feels like that complicates things quite a bit
> > (now we
> > >> > > need
> > >> > > > to
> > >> > > > >> >> > differentiate between heartbeat #1 and hearbeat #N).
> > >> > > > >> >> >
> > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <
> > >> > wfar...@apache.org
> > >> > > >
> > >> > > > >> wrote:
> > >> > > > >> >> >
> > >> > > > >> >> >> What is the guidance for deploying while the heartbeat
> > >> service
> > >> > > is
> > >> > > > >> >> broken?
> > >> > > > >> >> >> I think i know the answer, but it's important to spell
> > out.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> > Create a new coordinated job update in a paused
> > >> > > > >> (ROLL_FORWARD_PAUSED)
> > >> > > > >> >> >> > state to avoid any progress until the first heartbeat
> > call
> > >> > > > arrives.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> I'm not sold on this being ultimately beneficial.  In
> the
> > >> > worst
> > >> > > > case,
> > >> > > > >> >> >> impact is still limited by the health check threshold.
> > >> Seems
> > >> > > like
> > >> > > > >> >> >> premature optimization at best, and an odd one if we
> > proceed
> > >> > > > without
> > >> > > > >> a
> > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC.
> > >> > > > >> >> >>
> > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update
> > via
> > >> a
> > >> > > > >> >> >> > resumeJobUpdate call.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> Are heartbeats required while rolling back?  If so,
> that
> > >> might
> > >> > > > impact
> > >> > > > >> >> the
> > >> > > > >> >> >> design here and in other places.
> > >> > > > >> >> >>
> > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update
> > via
> > >> a
> > >> > > > fresh
> > >> > > > >> >> >> > heartbeatJobUpdate call.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but we
> don't
> > >> > have a
> > >> > > > >> NACK.
> > >> > > > >> >> If
> > >> > > > >> >> >> we are going to let lack-of-ACK serve as the NACK, i
> > don't
> > >> > think
> > >> > > > it's
> > >> > > > >> >> safe
> > >> > > > >> >> >> to resume when we receive another ACK.  In other
> words, a
> > >> > > service
> > >> > > > >> >> toggling
> > >> > > > >> >> >> unhealthy might not be deemed safe to proceed.
> > >> > > > >> >> >>
> > >> > > > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in case
> > of a
> > >> > > > >> user-paused
> > >> > > > >> >> job
> > >> > > > >> >> >> > update would make more sense as there is nothing
> > >> monitoring
> > >> > > > service
> > >> > > > >> >> could
> > >> > > > >> >> >> > do in that case. This should work fine with
> > pause/resume
> > >> > > > >> >> -aware/-agnostic
> > >> > > > >> >> >> > monitoring service implementation.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> This seems reasonable to me - heartbeats for a paused
> > update
> > >> > > > should
> > >> > > > >> not
> > >> > > > >> >> >> pose a risk, but can be safely ignored.
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >>
> > >> > > > >> >> >> -=Bill
> > >> > > > >> >> >>
> > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <
> > >> > > > >> ma...@apache.org>
> > >> > > > >> >> >> wrote:
> > >> > > > >> >> >>
> > >> > > > >> >> >> > Agreed. That would be a logical generalization of the
> > post
> > >> > > > failover
> > >> > > > >> >> >> > behavior.
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > I have updated the above document with the following
> > >> > changes:
> > >> > > > >> >> >> > - Reply with PAUSED any time a job was paused by
> user;
> > >> > > > >> >> >> > - Start in paused state by default.
> > >> > > > >> >> >> >
> > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <
> > >> > > > >> kevi...@apache.org>
> > >> > > > >> >> >> > wrote:
> > >> > > > >> >> >> > > The doc mentioned that the scheduler will start an
> > >> update
> > >> > > > >> subject to
> > >> > > > >> >> >> the
> > >> > > > >> >> >> > > heartbeat countdown, and if it doesn't receive a
> > >> heartbeat
> > >> > > it
> > >> > > > >> will
> > >> > > > >> >> >> pause
> > >> > > > >> >> >> > > the update. Why not start with the update
> > >> > > > >> >> paused-due-to-no-heartbeat to
> > >> > > > >> >> >> > > fail-fast any connectivity issues between the
> service
> > >> > > > providing
> > >> > > > >> the
> > >> > > > >> >> >> > > heartbeats and the scheduler?
> > >> > > > >> >> >> > >
> > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim
> Khutornenko <
> > >> > > > >> >> ma...@apache.org>
> > >> > > > >> >> >> > > wrote:
> > >> > > > >> >> >> > >
> > >> > > > >> >> >> > >> Hi all,
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> We are proposing a new feature for the scheduler
> > >> updater,
> > >> > > > which
> > >> > > > >> you
> > >> > > > >> >> >> > >> may find helpful.
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> I have posed a brief feature summary here:
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >>
> > >> > > > >>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> Please, reply with your
> feedback/concerns/comments.
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> > >> Thanks,
> > >> > > > >> >> >> > >> Maxim
> > >> > > > >> >> >> > >>
> > >> > > > >> >> >> >
> > >> > > > >> >> >>
> > >> > > > >> >>
> > >> > > > >>
> > >> > > >
> > >> > >
> > >> >
> > >>
> >
>



-- 
Kevin Sweeney
@kts

Reply via email to