Re: Proposal: External Update Coordination

Maxim Khutornenko Tue, 14 Oct 2014 12:12:24 -0700

Pausing update on creation seems like a logical approach when dealing
with inverted dependency model. I.e. updater is happy to act as long
as it's greenlighted by the external signal. It's also aligned with a
failover experience where coordinated updates are rehydrated in paused
state waiting for HB awakening. That said, I am OK punting it for the
sake of simplicity for now.


Kevin?

On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <wfar...@apache.org> wrote:
> If the goal is to reduce complexity now and add features later, why not
> nuke both for now - kick off the update right away, and let lack of
> heartbeats serve as a uniform "unknown or unhealthy" signal?
>
> -=Bill
>
> On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> I am still +1 on the idea to have default paused state on creation. I
>> think we could still differentiate between initially paused and timed
>> out states internally by looking at pause reason. It's quite different
>> if we want to store explicit NACK reasons from the external service
>> though. That would require persistence and a bit more complicated
>> logic.
>>
>> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney <kevi...@apache.org> wrote:
>> > I like the idea of implementing this scheduler-side purely through
>> volatile
>> > state, but the lack of feedback (generic vs specific error messages when
>> an
>> > update is paused) leaves something to be desired. Maybe we can address
>> that
>> > with a metadata field in the initial call to startUpdate (with an
>> optional
>> > link to a page where one can get more rich information about the state of
>> > the monitor sending/not sending heartbeats).
>> >
>> > The main drawback is that we may have to wait a maximum of one heartbeat
>> > interval to find out that an update should be paused.
>> >
>> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <ma...@apache.org>
>> wrote:
>> >
>> >> The main reason I preferred the lack-of-ACK approach over an explicit
>> >> NACK one is simplicity. As Joshua pointed out there is more state to
>> >> handle in that case. The lack-of-ACK model can be completely
>> >> implemented in volatile memory sidestepping the persistent storage
>> >> entirely. With the NACK we would need to reliably persist external
>> >> service call reasons to survive scheduler failovers. Not a huge
>> >> challenge but something to keep in mind.
>> >>
>> >> I still think the simplicity/reliability tradeoff is acceptable here
>> >> if we rely on external service to abort heartbeats in case of a health
>> >> alert fired. This can be explicitly documented as an external
>> >> integration requirement. However, If the consensus is to go a more
>> >> reliable (though more complicated) NACK route I am happy to reconsider
>> >> the current proposal.
>> >>
>> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <jco...@twopensource.com>
>> >> wrote:
>> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we don't have a
>> NACK.
>> >> If
>> >> > we are going to let lack-of-ACK serve as the NACK, i don't think it's
>> >> safe
>> >> > to resume when we receive another ACK.  In other words, a service
>> >> toggling
>> >> > unhealthy might not be deemed safe to proceed."
>> >> >
>> >> > Lack-of-ACK is the scenario where connectivity between the monitor and
>> >> the
>> >> > scheduler is unavailable. Shouldn't the NACK scenario (everything is
>> not
>> >> > ok!) be handled by the monitoring service triggering an explicit
>> pause?
>> >> > I.e. section 2 should be updated to say "External service detects
>> service
>> >> > health problems and pauses the update" and section 4 becomes the
>> current
>> >> > section 2 (i.e. "Should a heartbeat not be received the scheduler
>> pauses
>> >> > the update.").
>> >> >
>> >> > I agree that it's unsafe to to resume updates after receiving a
>> heartbeat
>> >> > after previously pausing due to a missed heartbeat. In that scenario
>> I'd
>> >> > think we'd want an explicit resumeJobUpdate. If the scenario we're
>> trying
>> >> > to handle is *never* received a heartbeat, that's a separate matter,
>> in
>> >> > that case unpausing upon receiving the first heartbeat would make
>> sense,
>> >> > but it feels like that complicates things quite a bit (now we need to
>> >> > differentiate between heartbeat #1 and hearbeat #N).
>> >> >
>> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org>
>> wrote:
>> >> >
>> >> >> What is the guidance for deploying while the heartbeat service is
>> >> broken?
>> >> >> I think i know the answer, but it's important to spell out.
>> >> >>
>> >> >>
>> >> >>
>> >> >> > Create a new coordinated job update in a paused
>> (ROLL_FORWARD_PAUSED)
>> >> >> > state to avoid any progress until the first heartbeat call arrives.
>> >> >>
>> >> >>
>> >> >> I'm not sold on this being ultimately beneficial.  In the worst case,
>> >> >> impact is still limited by the health check threshold.  Seems like
>> >> >> premature optimization at best, and an odd one if we proceed without
>> a
>> >> >> 'NACK' signal via the heartbeatJobUpdate RPC.
>> >> >>
>> >> >> Allow resuming of the paused-due-to-no-heartbeat update via a
>> >> >> > resumeJobUpdate call.
>> >> >>
>> >> >>
>> >> >> Are heartbeats required while rolling back?  If so, that might impact
>> >> the
>> >> >> design here and in other places.
>> >> >>
>> >> >> Allow resuming of the paused-due-to-no-heartbeat update via a fresh
>> >> >> > heartbeatJobUpdate call.
>> >> >>
>> >> >>
>> >> >> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a
>> NACK.
>> >> If
>> >> >> we are going to let lack-of-ACK serve as the NACK, i don't think it's
>> >> safe
>> >> >> to resume when we receive another ACK.  In other words, a service
>> >> toggling
>> >> >> unhealthy might not be deemed safe to proceed.
>> >> >>
>> >> >> Perhaps just sending OK (or a NOOP equivalent) in case of a
>> user-paused
>> >> job
>> >> >> > update would make more sense as there is nothing monitoring service
>> >> could
>> >> >> > do in that case. This should work fine with pause/resume
>> >> -aware/-agnostic
>> >> >> > monitoring service implementation.
>> >> >>
>> >> >>
>> >> >> This seems reasonable to me - heartbeats for a paused update should
>> not
>> >> >> pose a risk, but can be safely ignored.
>> >> >>
>> >> >>
>> >> >>
>> >> >> -=Bill
>> >> >>
>> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <
>> ma...@apache.org>
>> >> >> wrote:
>> >> >>
>> >> >> > Agreed. That would be a logical generalization of the post failover
>> >> >> > behavior.
>> >> >> >
>> >> >> > I have updated the above document with the following changes:
>> >> >> > - Reply with PAUSED any time a job was paused by user;
>> >> >> > - Start in paused state by default.
>> >> >> >
>> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <
>> kevi...@apache.org>
>> >> >> > wrote:
>> >> >> > > The doc mentioned that the scheduler will start an update
>> subject to
>> >> >> the
>> >> >> > > heartbeat countdown, and if it doesn't receive a heartbeat it
>> will
>> >> >> pause
>> >> >> > > the update. Why not start with the update
>> >> paused-due-to-no-heartbeat to
>> >> >> > > fail-fast any connectivity issues between the service providing
>> the
>> >> >> > > heartbeats and the scheduler?
>> >> >> > >
>> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <
>> >> ma...@apache.org>
>> >> >> > > wrote:
>> >> >> > >
>> >> >> > >> Hi all,
>> >> >> > >>
>> >> >> > >> We are proposing a new feature for the scheduler updater, which
>> you
>> >> >> > >> may find helpful.
>> >> >> > >>
>> >> >> > >> I have posed a brief feature summary here:
>> >> >> > >>
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
>> >> >> > >>
>> >> >> > >> Please, reply with your feedback/concerns/comments.
>> >> >> > >>
>> >> >> > >> Thanks,
>> >> >> > >> Maxim
>> >> >> > >>
>> >> >> >
>> >> >>
>> >>
>>

Re: Proposal: External Update Coordination

Reply via email to