Re: Proposal: External Update Coordination

Bill Farner Tue, 14 Oct 2014 12:20:11 -0700

Wait - simpler solution than what?  We're talking about not doing either.

-=Bill


On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <kevi...@apache.org> wrote:

> I think waiting for the first heartbeat before taking any action is the
> simpler solution here as it allows the implementation to be entirely
> soft-state and still catches the bugs I described.
>
> The implementation is just PulseMonitorImpl<UpdateId> - heartbeat calls
> pulse and mutation operations check isAlive. I think the code might
> actually work as-is.
>
> On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > Pausing update on creation seems like a logical approach when dealing
> > with inverted dependency model. I.e. updater is happy to act as long
> > as it's greenlighted by the external signal. It's also aligned with a
> > failover experience where coordinated updates are rehydrated in paused
> > state waiting for HB awakening. That said, I am OK punting it for the
> > sake of simplicity for now.
> >
> > Kevin?
> >
> > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <wfar...@apache.org>
> wrote:
> > > If the goal is to reduce complexity now and add features later, why not
> > > nuke both for now - kick off the update right away, and let lack of
> > > heartbeats serve as a uniform "unknown or unhealthy" signal?
> > >
> > > -=Bill
> > >
> > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko <ma...@apache.org>
> > wrote:
> > >
> > >> I am still +1 on the idea to have default paused state on creation. I
> > >> think we could still differentiate between initially paused and timed
> > >> out states internally by looking at pause reason. It's quite different
> > >> if we want to store explicit NACK reasons from the external service
> > >> though. That would require persistence and a bit more complicated
> > >> logic.
> > >>
> > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney <kevi...@apache.org>
> > wrote:
> > >> > I like the idea of implementing this scheduler-side purely through
> > >> volatile
> > >> > state, but the lack of feedback (generic vs specific error messages
> > when
> > >> an
> > >> > update is paused) leaves something to be desired. Maybe we can
> address
> > >> that
> > >> > with a metadata field in the initial call to startUpdate (with an
> > >> optional
> > >> > link to a page where one can get more rich information about the
> > state of
> > >> > the monitor sending/not sending heartbeats).
> > >> >
> > >> > The main drawback is that we may have to wait a maximum of one
> > heartbeat
> > >> > interval to find out that an update should be paused.
> > >> >
> > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <
> ma...@apache.org>
> > >> wrote:
> > >> >
> > >> >> The main reason I preferred the lack-of-ACK approach over an
> explicit
> > >> >> NACK one is simplicity. As Joshua pointed out there is more state
> to
> > >> >> handle in that case. The lack-of-ACK model can be completely
> > >> >> implemented in volatile memory sidestepping the persistent storage
> > >> >> entirely. With the NACK we would need to reliably persist external
> > >> >> service call reasons to survive scheduler failovers. Not a huge
> > >> >> challenge but something to keep in mind.
> > >> >>
> > >> >> I still think the simplicity/reliability tradeoff is acceptable
> here
> > >> >> if we rely on external service to abort heartbeats in case of a
> > health
> > >> >> alert fired. This can be explicitly documented as an external
> > >> >> integration requirement. However, If the consensus is to go a more
> > >> >> reliable (though more complicated) NACK route I am happy to
> > reconsider
> > >> >> the current proposal.
> > >> >>
> > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <
> > jco...@twopensource.com>
> > >> >> wrote:
> > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we don't have a
> > >> NACK.
> > >> >> If
> > >> >> > we are going to let lack-of-ACK serve as the NACK, i don't think
> > it's
> > >> >> safe
> > >> >> > to resume when we receive another ACK.  In other words, a service
> > >> >> toggling
> > >> >> > unhealthy might not be deemed safe to proceed."
> > >> >> >
> > >> >> > Lack-of-ACK is the scenario where connectivity between the
> monitor
> > and
> > >> >> the
> > >> >> > scheduler is unavailable. Shouldn't the NACK scenario (everything
> > is
> > >> not
> > >> >> > ok!) be handled by the monitoring service triggering an explicit
> > >> pause?
> > >> >> > I.e. section 2 should be updated to say "External service detects
> > >> service
> > >> >> > health problems and pauses the update" and section 4 becomes the
> > >> current
> > >> >> > section 2 (i.e. "Should a heartbeat not be received the scheduler
> > >> pauses
> > >> >> > the update.").
> > >> >> >
> > >> >> > I agree that it's unsafe to to resume updates after receiving a
> > >> heartbeat
> > >> >> > after previously pausing due to a missed heartbeat. In that
> > scenario
> > >> I'd
> > >> >> > think we'd want an explicit resumeJobUpdate. If the scenario
> we're
> > >> trying
> > >> >> > to handle is *never* received a heartbeat, that's a separate
> > matter,
> > >> in
> > >> >> > that case unpausing upon receiving the first heartbeat would make
> > >> sense,
> > >> >> > but it feels like that complicates things quite a bit (now we
> need
> > to
> > >> >> > differentiate between heartbeat #1 and hearbeat #N).
> > >> >> >
> > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org
> >
> > >> wrote:
> > >> >> >
> > >> >> >> What is the guidance for deploying while the heartbeat service
> is
> > >> >> broken?
> > >> >> >> I think i know the answer, but it's important to spell out.
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> > Create a new coordinated job update in a paused
> > >> (ROLL_FORWARD_PAUSED)
> > >> >> >> > state to avoid any progress until the first heartbeat call
> > arrives.
> > >> >> >>
> > >> >> >>
> > >> >> >> I'm not sold on this being ultimately beneficial.  In the worst
> > case,
> > >> >> >> impact is still limited by the health check threshold.  Seems
> like
> > >> >> >> premature optimization at best, and an odd one if we proceed
> > without
> > >> a
> > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC.
> > >> >> >>
> > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update via a
> > >> >> >> > resumeJobUpdate call.
> > >> >> >>
> > >> >> >>
> > >> >> >> Are heartbeats required while rolling back?  If so, that might
> > impact
> > >> >> the
> > >> >> >> design here and in other places.
> > >> >> >>
> > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update via a
> > fresh
> > >> >> >> > heartbeatJobUpdate call.
> > >> >> >>
> > >> >> >>
> > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a
> > >> NACK.
> > >> >> If
> > >> >> >> we are going to let lack-of-ACK serve as the NACK, i don't think
> > it's
> > >> >> safe
> > >> >> >> to resume when we receive another ACK.  In other words, a
> service
> > >> >> toggling
> > >> >> >> unhealthy might not be deemed safe to proceed.
> > >> >> >>
> > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in case of a
> > >> user-paused
> > >> >> job
> > >> >> >> > update would make more sense as there is nothing monitoring
> > service
> > >> >> could
> > >> >> >> > do in that case. This should work fine with pause/resume
> > >> >> -aware/-agnostic
> > >> >> >> > monitoring service implementation.
> > >> >> >>
> > >> >> >>
> > >> >> >> This seems reasonable to me - heartbeats for a paused update
> > should
> > >> not
> > >> >> >> pose a risk, but can be safely ignored.
> > >> >> >>
> > >> >> >>
> > >> >> >>
> > >> >> >> -=Bill
> > >> >> >>
> > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <
> > >> ma...@apache.org>
> > >> >> >> wrote:
> > >> >> >>
> > >> >> >> > Agreed. That would be a logical generalization of the post
> > failover
> > >> >> >> > behavior.
> > >> >> >> >
> > >> >> >> > I have updated the above document with the following changes:
> > >> >> >> > - Reply with PAUSED any time a job was paused by user;
> > >> >> >> > - Start in paused state by default.
> > >> >> >> >
> > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <
> > >> kevi...@apache.org>
> > >> >> >> > wrote:
> > >> >> >> > > The doc mentioned that the scheduler will start an update
> > >> subject to
> > >> >> >> the
> > >> >> >> > > heartbeat countdown, and if it doesn't receive a heartbeat
> it
> > >> will
> > >> >> >> pause
> > >> >> >> > > the update. Why not start with the update
> > >> >> paused-due-to-no-heartbeat to
> > >> >> >> > > fail-fast any connectivity issues between the service
> > providing
> > >> the
> > >> >> >> > > heartbeats and the scheduler?
> > >> >> >> > >
> > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <
> > >> >> ma...@apache.org>
> > >> >> >> > > wrote:
> > >> >> >> > >
> > >> >> >> > >> Hi all,
> > >> >> >> > >>
> > >> >> >> > >> We are proposing a new feature for the scheduler updater,
> > which
> > >> you
> > >> >> >> > >> may find helpful.
> > >> >> >> > >>
> > >> >> >> > >> I have posed a brief feature summary here:
> > >> >> >> > >>
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > >> >> >> > >>
> > >> >> >> > >> Please, reply with your feedback/concerns/comments.
> > >> >> >> > >>
> > >> >> >> > >> Thanks,
> > >> >> >> > >> Maxim
> > >> >> >> > >>
> > >> >> >> >
> > >> >> >>
> > >> >>
> > >>
> >
>

Re: Proposal: External Update Coordination

Reply via email to