Re: Proposal: External Update Coordination

Bill Farner Thu, 16 Oct 2014 12:30:39 -0700

+1

-=Bill


On Thu, Oct 16, 2014 at 11:21 AM, Maxim Khutornenko <ma...@apache.org>
wrote:

> Correct. The presence of the SessionKey does indeed mean that
> heartbeats are going to be authenticated. Given that external service
> has to solve authentication story to use pauseJobUpdate anyway, having
> heartbeats authenticated seems like a natural progression. Also, given
> our current admin thrift interface it's easier to have an
> authenticated RPC in it rather than not (scheduler_client.py
> automatically injects SessionKey arg into all methods that are not
> part of ReadOnlyScheduler interface).
>
> On Thu, Oct 16, 2014 at 10:52 AM, Kevin Sweeney
> <kswee...@twitter.com.invalid> wrote:
> > I inferred that authentication was required due to the presence of a
> > SessionKey in the RPC. Of course any authentication mechanism here could
> > have serious scaling issues (barring something like HTTP basic auth in
> > memory)
> >
> > On Thu, Oct 16, 2014 at 10:48 AM, Joshua Cohen <jco...@twopensource.com>
> > wrote:
> >
> >> What are our thoughts about authentication with regards to heartbeats?
> It
> >> seems like they should be authenticated since there does exist the
> >> potential for a malicious actor to send its own heartbeats even if the
> real
> >> monitoring service has detected a problem and ceased sending heartbeats.
> >> I'm not sure exactly how large the attack surface is (if the service is
> >> truly down the scheduler would detect that and roll back the update
> >> regardless), but I think it's worth discussing as we work on the initial
> >> design.
> >>
> >> On Wed, Oct 15, 2014 at 7:49 PM, Bill Farner <wfar...@apache.org>
> wrote:
> >>
> >> > David - the plan is to synthesize the waiting state.  Exactly how is
> not
> >> > yet certain.
> >> >
> >> > On Wednesday, October 15, 2014, Maxim Khutornenko <ma...@apache.org>
> >> > wrote:
> >> >
> >> > > It is certainly possible to add new state or a status message but I
> >> > > don't think it's a blocker for the first iteration. Provided there
> is
> >> > > enough demand a state/message could be synthesized during the 'get'
> >> > > call based on the volatile state.
> >> > >
> >> > > On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin <
> >> da...@dmclaughlin.com
> >> > > <javascript:;>> wrote:
> >> > > > +1 for pause being explicit RPC pauses, but does it really add
> >> > complexity
> >> > > > to just add a new state (WAITING?) when no heartbeat is sent? Not
> >> being
> >> > > > able to see that an update was blocked because of a lack of
> heartbeat
> >> > > seems
> >> > > > like a missing feature.
> >> > > >
> >> > > > On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko <
> ma...@apache.org
> >> > > <javascript:;>> wrote:
> >> > > >
> >> > > >> +1. Updated the doc:
> >> > > >>
> >> > > >>
> >> > >
> >> >
> >>
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> >> > > >>
> >> > > >> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfar...@apache.org
> >> > > <javascript:;>> wrote:
> >> > > >> > +1 to the scheduler not proceeding on an update when heartbeats
> >> are
> >> > > >> absent,
> >> > > >> > and requiring the heartbeat service to explicitly call
> >> > pauseJobUpdate
> >> > > >> when
> >> > > >> > it detects problems.
> >> > > >> >
> >> > > >> > -=Bill
> >> > > >> >
> >> > > >> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney
> >> > > >> <kswee...@twitter.com.invalid
> >> > > >> >> wrote:
> >> > > >> >
> >> > > >> >> Chatted with Maxim and Bill, I think we figured it out
> >> > > >> >>
> >> > > >> >> I think the confusion stems from the fact that there are two
> >> types
> >> > of
> >> > > >> >> pauses in this system, explicit, persisted pauses generated by
> >> the
> >> > > >> >> pauseJobUpdate RPC and implicit, volatile pauses caused due to
> >> the
> >> > > >> absence
> >> > > >> >> of a sufficiently fresh heartbeat (such as in the case of a
> >> network
> >> > > >> >> partition).
> >> > > >> >>
> >> > > >> >> In case a monitoring service detects a problem it should call
> the
> >> > > >> explicit
> >> > > >> >> pauseJobUpdate RPC, which will cause a state change that
> requires
> >> > an
> >> > > >> >> explicit resumeJobUpdate RPC to resume. That feature already
> >> > exists.
> >> > > >> >>
> >> > > >> >> But, we need one more thing to make this reliable -
> heartbeats to
> >> > > >> protect
> >> > > >> >> against network partitions between the scheduler and the
> >> monitoring
> >> > > >> >> service. These can be volatile and lightweight - the scheduler
> >> just
> >> > > >> checks
> >> > > >> >> for a sufficiently fresh heartbeat before it performs an
> update
> >> > > action,
> >> > > >> and
> >> > > >> >> if none is present it simply refuses to perform the action. If
> >> the
> >> > > >> >> partition heals a new heartbeat will arrive (if the update
> being
> >> > > >> monitored
> >> > > >> >> should still be allowed to proceed) and the scheduler will
> allow
> >> > the
> >> > > >> update
> >> > > >> >> to proceed.
> >> > > >> >>
> >> > > >> >>
> >> > > >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <
> >> wfar...@apache.org
> >> > > <javascript:;>>
> >> > > >> wrote:
> >> > > >> >>
> >> > > >> >> > I think we should assess that after building the rest of the
> >> > > feature.
> >> > > >> >> IIUC
> >> > > >> >> > the rest of the code doesn't care if the update is initially
> >> > > paused.
> >> > > >> >> >
> >> > > >> >> > -=Bill
> >> > > >> >> >
> >> > > >> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <
> >> > > ma...@apache.org <javascript:;>
> >> > > >> >
> >> > > >> >> > wrote:
> >> > > >> >> >
> >> > > >> >> > > Can we get a consensus here? Looks like the only sticky
> point
> >> > > left
> >> > > >> is
> >> > > >> >> > > around starting an update in paused vs. non-paused state.
> I
> >> can
> >> > > >> argue
> >> > > >> >> > > either way as it's easy to add later if needed.
> >> > > >> >> > >
> >> > > >> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <
> >> > wfar...@apache.org
> >> > > <javascript:;>>
> >> > > >> >> wrote:
> >> > > >> >> > > > I'm not arguing against the merits of the approach.
> Just
> >> > > feeling
> >> > > >> out
> >> > > >> >> > > > whether that should be done _after_ the rest of the
> >> heartbeat
> >> > > >> >> support.
> >> > > >> >> > > > Seems like it can be cleanly added at the end to get
> >> > something
> >> > > >> usable
> >> > > >> >> > > > earlier.
> >> > > >> >> > > >
> >> > > >> >> > > > -=Bill
> >> > > >> >> > > >
> >> > > >> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <
> >> > > >> kevi...@apache.org <javascript:;>>
> >> > > >> >> > > wrote:
> >> > > >> >> > > >
> >> > > >> >> > > >> I'm +1 for using lack of heartbeats as a uniform
> >> > > >> >> unknown-or-unhealthy
> >> > > >> >> > > >> signal, and punting on a more complex NACK signal
> (which
> >> > we'd
> >> > > >> have
> >> > > >> >> to
> >> > > >> >> > > >> reliably persist).
> >> > > >> >> > > >>
> >> > > >> >> > > >> I think the only disagreement in this thread is whether
> >> the
> >> > > >> default
> >> > > >> >> > > state
> >> > > >> >> > > >> for a new update should be running or
> >> > waiting-for-heartbeat. I
> >> > > >> think
> >> > > >> >> > > >> waiting for a heartbeat is not only a more correct
> >> > > implementation
> >> > > >> >> (no
> >> > > >> >> > > risk
> >> > > >> >> > > >> of acting after a failover but before the heartbeat
> >> timeout)
> >> > > but
> >> > > >> >> > > simpler to
> >> > > >> >> > > >> implement (initialize the PulseMonitor data structure
> as
> >> > empty
> >> > > >> >> rather
> >> > > >> >> > > than
> >> > > >> >> > > >> with a synthetic heartbeat).
> >> > > >> >> > > >>
> >> > > >> >> > > >> From an API consumer perspective the sequence is:
> >> > > >> >> > > >>
> >> > > >> >> > > >> 1. API client sends a startUpdate RPC to the scheduler
> >> > > >> >> > > >> 2. API client receives an OK response, then arranges
> for
> >> > > >> something
> >> > > >> >> to
> >> > > >> >> > > call
> >> > > >> >> > > >> heartbeat with that updateId on some interval
> >> > > >> >> > > >> 3. Whatever is supposed to send heartbeats sends one
> >> > > immediately,
> >> > > >> >> then
> >> > > >> >> > > >> starts sending them on some smaller interval
> >> > > >> >> > > >>
> >> > > >> >> > > >> Waiting for the first heartbeat ensures that this
> sequence
> >> > has
> >> > > >> been
> >> > > >> >> > > >> completed successfully, while not waiting for it only
> >> ensure
> >> > > that
> >> > > >> >> > step 1
> >> > > >> >> > > >> has happened.
> >> > > >> >> > > >>
> >> > > >> >> > > >>
> >> > > >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <
> >> > > >> wfar...@apache.org <javascript:;>>
> >> > > >> >> > > wrote:
> >> > > >> >> > > >>
> >> > > >> >> > > >> > Wait - simpler solution than what?  We're talking
> about
> >> > not
> >> > > >> doing
> >> > > >> >> > > either.
> >> > > >> >> > > >> >
> >> > > >> >> > > >> > -=Bill
> >> > > >> >> > > >> >
> >> > > >> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <
> >> > > >> >> kevi...@apache.org <javascript:;>
> >> > > >> >> > >
> >> > > >> >> > > >> > wrote:
> >> > > >> >> > > >> >
> >> > > >> >> > > >> > > I think waiting for the first heartbeat before
> taking
> >> > any
> >> > > >> action
> >> > > >> >> > is
> >> > > >> >> > > the
> >> > > >> >> > > >> > > simpler solution here as it allows the
> implementation
> >> to
> >> > > be
> >> > > >> >> > entirely
> >> > > >> >> > > >> > > soft-state and still catches the bugs I described.
> >> > > >> >> > > >> > >
> >> > > >> >> > > >> > > The implementation is just
> PulseMonitorImpl<UpdateId>
> >> -
> >> > > >> >> heartbeat
> >> > > >> >> > > calls
> >> > > >> >> > > >> > > pulse and mutation operations check isAlive. I
> think
> >> the
> >> > > code
> >> > > >> >> > might
> >> > > >> >> > > >> > > actually work as-is.
> >> > > >> >> > > >> > >
> >> > > >> >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim
> Khutornenko <
> >> > > >> >> > > ma...@apache.org <javascript:;>>
> >> > > >> >> > > >> > > wrote:
> >> > > >> >> > > >> > >
> >> > > >> >> > > >> > > > Pausing update on creation seems like a logical
> >> > approach
> >> > > >> when
> >> > > >> >> > > dealing
> >> > > >> >> > > >> > > > with inverted dependency model. I.e. updater is
> >> happy
> >> > to
> >> > > >> act
> >> > > >> >> as
> >> > > >> >> > > long
> >> > > >> >> > > >> > > > as it's greenlighted by the external signal. It's
> >> also
> >> > > >> aligned
> >> > > >> >> > > with a
> >> > > >> >> > > >> > > > failover experience where coordinated updates are
> >> > > >> rehydrated
> >> > > >> >> in
> >> > > >> >> > > >> paused
> >> > > >> >> > > >> > > > state waiting for HB awakening. That said, I am
> OK
> >> > > punting
> >> > > >> it
> >> > > >> >> > for
> >> > > >> >> > > the
> >> > > >> >> > > >> > > > sake of simplicity for now.
> >> > > >> >> > > >> > > >
> >> > > >> >> > > >> > > > Kevin?
> >> > > >> >> > > >> > > >
> >> > > >> >> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <
> >> > > >> >> > wfar...@apache.org <javascript:;>
> >> > > >> >> > > >
> >> > > >> >> > > >> > > wrote:
> >> > > >> >> > > >> > > > > If the goal is to reduce complexity now and add
> >> > > features
> >> > > >> >> > later,
> >> > > >> >> > > why
> >> > > >> >> > > >> > not
> >> > > >> >> > > >> > > > > nuke both for now - kick off the update right
> >> away,
> >> > > and
> >> > > >> let
> >> > > >> >> > > lack of
> >> > > >> >> > > >> > > > > heartbeats serve as a uniform "unknown or
> >> unhealthy"
> >> > > >> signal?
> >> > > >> >> > > >> > > > >
> >> > > >> >> > > >> > > > > -=Bill
> >> > > >> >> > > >> > > > >
> >> > > >> >> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim
> >> Khutornenko <
> >> > > >> >> > > >> ma...@apache.org <javascript:;>
> >> > > >> >> > > >> > >
> >> > > >> >> > > >> > > > wrote:
> >> > > >> >> > > >> > > > >
> >> > > >> >> > > >> > > > >> I am still +1 on the idea to have default
> paused
> >> > > state
> >> > > >> on
> >> > > >> >> > > >> creation.
> >> > > >> >> > > >> > I
> >> > > >> >> > > >> > > > >> think we could still differentiate between
> >> > initially
> >> > > >> paused
> >> > > >> >> > and
> >> > > >> >> > > >> > timed
> >> > > >> >> > > >> > > > >> out states internally by looking at pause
> reason.
> >> > > It's
> >> > > >> >> quite
> >> > > >> >> > > >> > different
> >> > > >> >> > > >> > > > >> if we want to store explicit NACK reasons from
> >> the
> >> > > >> external
> >> > > >> >> > > >> service
> >> > > >> >> > > >> > > > >> though. That would require persistence and a
> bit
> >> > more
> >> > > >> >> > > complicated
> >> > > >> >> > > >> > > > >> logic.
> >> > > >> >> > > >> > > > >>
> >> > > >> >> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin
> Sweeney <
> >> > > >> >> > > >> kevi...@apache.org <javascript:;>>
> >> > > >> >> > > >> > > > wrote:
> >> > > >> >> > > >> > > > >> > I like the idea of implementing this
> >> > scheduler-side
> >> > > >> >> purely
> >> > > >> >> > > >> through
> >> > > >> >> > > >> > > > >> volatile
> >> > > >> >> > > >> > > > >> > state, but the lack of feedback (generic vs
> >> > > specific
> >> > > >> >> error
> >> > > >> >> > > >> > messages
> >> > > >> >> > > >> > > > when
> >> > > >> >> > > >> > > > >> an
> >> > > >> >> > > >> > > > >> > update is paused) leaves something to be
> >> desired.
> >> > > >> Maybe
> >> > > >> >> we
> >> > > >> >> > > can
> >> > > >> >> > > >> > > address
> >> > > >> >> > > >> > > > >> that
> >> > > >> >> > > >> > > > >> > with a metadata field in the initial call to
> >> > > >> startUpdate
> >> > > >> >> > > (with
> >> > > >> >> > > >> an
> >> > > >> >> > > >> > > > >> optional
> >> > > >> >> > > >> > > > >> > link to a page where one can get more rich
> >> > > information
> >> > > >> >> > about
> >> > > >> >> > > the
> >> > > >> >> > > >> > > > state of
> >> > > >> >> > > >> > > > >> > the monitor sending/not sending heartbeats).
> >> > > >> >> > > >> > > > >> >
> >> > > >> >> > > >> > > > >> > The main drawback is that we may have to
> wait a
> >> > > >> maximum
> >> > > >> >> of
> >> > > >> >> > > one
> >> > > >> >> > > >> > > > heartbeat
> >> > > >> >> > > >> > > > >> > interval to find out that an update should
> be
> >> > > paused.
> >> > > >> >> > > >> > > > >> >
> >> > > >> >> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim
> >> > Khutornenko
> >> > > <
> >> > > >> >> > > >> > > ma...@apache.org <javascript:;>>
> >> > > >> >> > > >> > > > >> wrote:
> >> > > >> >> > > >> > > > >> >
> >> > > >> >> > > >> > > > >> >> The main reason I preferred the lack-of-ACK
> >> > > approach
> >> > > >> >> over
> >> > > >> >> > an
> >> > > >> >> > > >> > > explicit
> >> > > >> >> > > >> > > > >> >> NACK one is simplicity. As Joshua pointed
> out
> >> > > there
> >> > > >> is
> >> > > >> >> > more
> >> > > >> >> > > >> state
> >> > > >> >> > > >> > > to
> >> > > >> >> > > >> > > > >> >> handle in that case. The lack-of-ACK model
> can
> >> > be
> >> > > >> >> > completely
> >> > > >> >> > > >> > > > >> >> implemented in volatile memory sidestepping
> >> the
> >> > > >> >> persistent
> >> > > >> >> > > >> > storage
> >> > > >> >> > > >> > > > >> >> entirely. With the NACK we would need to
> >> > reliably
> >> > > >> >> persist
> >> > > >> >> > > >> > external
> >> > > >> >> > > >> > > > >> >> service call reasons to survive scheduler
> >> > > failovers.
> >> > > >> >> Not a
> >> > > >> >> > > huge
> >> > > >> >> > > >> > > > >> >> challenge but something to keep in mind.
> >> > > >> >> > > >> > > > >> >>
> >> > > >> >> > > >> > > > >> >> I still think the simplicity/reliability
> >> > tradeoff
> >> > > is
> >> > > >> >> > > acceptable
> >> > > >> >> > > >> > > here
> >> > > >> >> > > >> > > > >> >> if we rely on external service to abort
> >> > > heartbeats in
> >> > > >> >> case
> >> > > >> >> > > of a
> >> > > >> >> > > >> > > > health
> >> > > >> >> > > >> > > > >> >> alert fired. This can be explicitly
> documented
> >> > as
> >> > > an
> >> > > >> >> > > external
> >> > > >> >> > > >> > > > >> >> integration requirement. However, If the
> >> > consensus
> >> > > >> is to
> >> > > >> >> > go
> >> > > >> >> > > a
> >> > > >> >> > > >> > more
> >> > > >> >> > > >> > > > >> >> reliable (though more complicated) NACK
> route
> >> I
> >> > am
> >> > > >> happy
> >> > > >> >> > to
> >> > > >> >> > > >> > > > reconsider
> >> > > >> >> > > >> > > > >> >> the current proposal.
> >> > > >> >> > > >> > > > >> >>
> >> > > >> >> > > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua
> Cohen
> >> <
> >> > > >> >> > > >> > > > jco...@twopensource.com <javascript:;>>
> >> > > >> >> > > >> > > > >> >> wrote:
> >> > > >> >> > > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an
> >> ACK,
> >> > > but
> >> > > >> we
> >> > > >> >> > don't
> >> > > >> >> > > >> > have a
> >> > > >> >> > > >> > > > >> NACK.
> >> > > >> >> > > >> > > > >> >> If
> >> > > >> >> > > >> > > > >> >> > we are going to let lack-of-ACK serve as
> the
> >> > > NACK,
> >> > > >> i
> >> > > >> >> > don't
> >> > > >> >> > > >> > think
> >> > > >> >> > > >> > > > it's
> >> > > >> >> > > >> > > > >> >> safe
> >> > > >> >> > > >> > > > >> >> > to resume when we receive another ACK.
> In
> >> > other
> >> > > >> >> words,
> >> > > >> >> > a
> >> > > >> >> > > >> > service
> >> > > >> >> > > >> > > > >> >> toggling
> >> > > >> >> > > >> > > > >> >> > unhealthy might not be deemed safe to
> >> > proceed."
> >> > > >> >> > > >> > > > >> >> >
> >> > > >> >> > > >> > > > >> >> > Lack-of-ACK is the scenario where
> >> connectivity
> >> > > >> between
> >> > > >> >> > the
> >> > > >> >> > > >> > > monitor
> >> > > >> >> > > >> > > > and
> >> > > >> >> > > >> > > > >> >> the
> >> > > >> >> > > >> > > > >> >> > scheduler is unavailable. Shouldn't the
> NACK
> >> > > >> scenario
> >> > > >> >> > > >> > (everything
> >> > > >> >> > > >> > > > is
> >> > > >> >> > > >> > > > >> not
> >> > > >> >> > > >> > > > >> >> > ok!) be handled by the monitoring service
> >> > > >> triggering
> >> > > >> >> an
> >> > > >> >> > > >> > explicit
> >> > > >> >> > > >> > > > >> pause?
> >> > > >> >> > > >> > > > >> >> > I.e. section 2 should be updated to say
> >> > > "External
> >> > > >> >> > service
> >> > > >> >> > > >> > detects
> >> > > >> >> > > >> > > > >> service
> >> > > >> >> > > >> > > > >> >> > health problems and pauses the update"
> and
> >> > > section
> >> > > >> 4
> >> > > >> >> > > becomes
> >> > > >> >> > > >> > the
> >> > > >> >> > > >> > > > >> current
> >> > > >> >> > > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not
> be
> >> > > received
> >> > > >> >> the
> >> > > >> >> > > >> > scheduler
> >> > > >> >> > > >> > > > >> pauses
> >> > > >> >> > > >> > > > >> >> > the update.").
> >> > > >> >> > > >> > > > >> >> >
> >> > > >> >> > > >> > > > >> >> > I agree that it's unsafe to to resume
> >> updates
> >> > > after
> >> > > >> >> > > >> receiving a
> >> > > >> >> > > >> > > > >> heartbeat
> >> > > >> >> > > >> > > > >> >> > after previously pausing due to a missed
> >> > > >> heartbeat. In
> >> > > >> >> > > that
> >> > > >> >> > > >> > > > scenario
> >> > > >> >> > > >> > > > >> I'd
> >> > > >> >> > > >> > > > >> >> > think we'd want an explicit
> resumeJobUpdate.
> >> > If
> >> > > the
> >> > > >> >> > > scenario
> >> > > >> >> > > >> > > we're
> >> > > >> >> > > >> > > > >> trying
> >> > > >> >> > > >> > > > >> >> > to handle is *never* received a
> heartbeat,
> >> > > that's a
> >> > > >> >> > > separate
> >> > > >> >> > > >> > > > matter,
> >> > > >> >> > > >> > > > >> in
> >> > > >> >> > > >> > > > >> >> > that case unpausing upon receiving the
> first
> >> > > >> heartbeat
> >> > > >> >> > > would
> >> > > >> >> > > >> > make
> >> > > >> >> > > >> > > > >> sense,
> >> > > >> >> > > >> > > > >> >> > but it feels like that complicates things
> >> > quite
> >> > > a
> >> > > >> bit
> >> > > >> >> > > (now we
> >> > > >> >> > > >> > > need
> >> > > >> >> > > >> > > > to
> >> > > >> >> > > >> > > > >> >> > differentiate between heartbeat #1 and
> >> > hearbeat
> >> > > >> #N).
> >> > > >> >> > > >> > > > >> >> >
> >> > > >> >> > > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill
> >> Farner <
> >> > > >> >> > > >> > wfar...@apache.org <javascript:;>
> >> > > >> >> > > >> > > >
> >> > > >> >> > > >> > > > >> wrote:
> >> > > >> >> > > >> > > > >> >> >
> >> > > >> >> > > >> > > > >> >> >> What is the guidance for deploying while
> >> the
> >> > > >> >> heartbeat
> >> > > >> >> > > >> service
> >> > > >> >> > > >> > > is
> >> > > >> >> > > >> > > > >> >> broken?
> >> > > >> >> > > >> > > > >> >> >> I think i know the answer, but it's
> >> important
> >> > > to
> >> > > >> >> spell
> >> > > >> >> > > out.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> > Create a new coordinated job update
> in a
> >> > > paused
> >> > > >> >> > > >> > > > >> (ROLL_FORWARD_PAUSED)
> >> > > >> >> > > >> > > > >> >> >> > state to avoid any progress until the
> >> first
> >> > > >> >> heartbeat
> >> > > >> >> > > call
> >> > > >> >> > > >> > > > arrives.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> I'm not sold on this being ultimately
> >> > > >> beneficial.  In
> >> > > >> >> > the
> >> > > >> >> > > >> > worst
> >> > > >> >> > > >> > > > case,
> >> > > >> >> > > >> > > > >> >> >> impact is still limited by the health
> check
> >> > > >> >> threshold.
> >> > > >> >> > > >> Seems
> >> > > >> >> > > >> > > like
> >> > > >> >> > > >> > > > >> >> >> premature optimization at best, and an
> odd
> >> > one
> >> > > if
> >> > > >> we
> >> > > >> >> > > proceed
> >> > > >> >> > > >> > > > without
> >> > > >> >> > > >> > > > >> a
> >> > > >> >> > > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate
> >> RPC.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> Allow resuming of the
> >> > > paused-due-to-no-heartbeat
> >> > > >> >> update
> >> > > >> >> > > via
> >> > > >> >> > > >> a
> >> > > >> >> > > >> > > > >> >> >> > resumeJobUpdate call.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> Are heartbeats required while rolling
> back?
> >> > If
> >> > > >> so,
> >> > > >> >> > that
> >> > > >> >> > > >> might
> >> > > >> >> > > >> > > > impact
> >> > > >> >> > > >> > > > >> >> the
> >> > > >> >> > > >> > > > >> >> >> design here and in other places.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> Allow resuming of the
> >> > > paused-due-to-no-heartbeat
> >> > > >> >> update
> >> > > >> >> > > via
> >> > > >> >> > > >> a
> >> > > >> >> > > >> > > > fresh
> >> > > >> >> > > >> > > > >> >> >> > heartbeatJobUpdate call.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an
> >> ACK,
> >> > > but
> >> > > >> we
> >> > > >> >> > don't
> >> > > >> >> > > >> > have a
> >> > > >> >> > > >> > > > >> NACK.
> >> > > >> >> > > >> > > > >> >> If
> >> > > >> >> > > >> > > > >> >> >> we are going to let lack-of-ACK serve as
> >> the
> >> > > >> NACK, i
> >> > > >> >> > > don't
> >> > > >> >> > > >> > think
> >> > > >> >> > > >> > > > it's
> >> > > >> >> > > >> > > > >> >> safe
> >> > > >> >> > > >> > > > >> >> >> to resume when we receive another ACK.
> In
> >> > > other
> >> > > >> >> > words, a
> >> > > >> >> > > >> > > service
> >> > > >> >> > > >> > > > >> >> toggling
> >> > > >> >> > > >> > > > >> >> >> unhealthy might not be deemed safe to
> >> > proceed.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> Perhaps just sending OK (or a NOOP
> >> > equivalent)
> >> > > in
> >> > > >> >> case
> >> > > >> >> > > of a
> >> > > >> >> > > >> > > > >> user-paused
> >> > > >> >> > > >> > > > >> >> job
> >> > > >> >> > > >> > > > >> >> >> > update would make more sense as there
> is
> >> > > nothing
> >> > > >> >> > > >> monitoring
> >> > > >> >> > > >> > > > service
> >> > > >> >> > > >> > > > >> >> could
> >> > > >> >> > > >> > > > >> >> >> > do in that case. This should work fine
> >> with
> >> > > >> >> > > pause/resume
> >> > > >> >> > > >> > > > >> >> -aware/-agnostic
> >> > > >> >> > > >> > > > >> >> >> > monitoring service implementation.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> This seems reasonable to me - heartbeats
> >> for
> >> > a
> >> > > >> paused
> >> > > >> >> > > update
> >> > > >> >> > > >> > > > should
> >> > > >> >> > > >> > > > >> not
> >> > > >> >> > > >> > > > >> >> >> pose a risk, but can be safely ignored.
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> -=Bill
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim
> >> > > >> Khutornenko <
> >> > > >> >> > > >> > > > >> ma...@apache.org <javascript:;>>
> >> > > >> >> > > >> > > > >> >> >> wrote:
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >> >> > Agreed. That would be a logical
> >> > > generalization
> >> > > >> of
> >> > > >> >> the
> >> > > >> >> > > post
> >> > > >> >> > > >> > > > failover
> >> > > >> >> > > >> > > > >> >> >> > behavior.
> >> > > >> >> > > >> > > > >> >> >> >
> >> > > >> >> > > >> > > > >> >> >> > I have updated the above document with
> >> the
> >> > > >> >> following
> >> > > >> >> > > >> > changes:
> >> > > >> >> > > >> > > > >> >> >> > - Reply with PAUSED any time a job was
> >> > > paused by
> >> > > >> >> > user;
> >> > > >> >> > > >> > > > >> >> >> > - Start in paused state by default.
> >> > > >> >> > > >> > > > >> >> >> >
> >> > > >> >> > > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM,
> Kevin
> >> > > Sweeney
> >> > > >> <
> >> > > >> >> > > >> > > > >> kevi...@apache.org <javascript:;>>
> >> > > >> >> > > >> > > > >> >> >> > wrote:
> >> > > >> >> > > >> > > > >> >> >> > > The doc mentioned that the scheduler
> >> will
> >> > > >> start
> >> > > >> >> an
> >> > > >> >> > > >> update
> >> > > >> >> > > >> > > > >> subject to
> >> > > >> >> > > >> > > > >> >> >> the
> >> > > >> >> > > >> > > > >> >> >> > > heartbeat countdown, and if it
> doesn't
> >> > > >> receive a
> >> > > >> >> > > >> heartbeat
> >> > > >> >> > > >> > > it
> >> > > >> >> > > >> > > > >> will
> >> > > >> >> > > >> > > > >> >> >> pause
> >> > > >> >> > > >> > > > >> >> >> > > the update. Why not start with the
> >> update
> >> > > >> >> > > >> > > > >> >> paused-due-to-no-heartbeat to
> >> > > >> >> > > >> > > > >> >> >> > > fail-fast any connectivity issues
> >> between
> >> > > the
> >> > > >> >> > service
> >> > > >> >> > > >> > > > providing
> >> > > >> >> > > >> > > > >> the
> >> > > >> >> > > >> > > > >> >> >> > > heartbeats and the scheduler?
> >> > > >> >> > > >> > > > >> >> >> > >
> >> > > >> >> > > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM,
> Maxim
> >> > > >> >> > Khutornenko <
> >> > > >> >> > > >> > > > >> >> ma...@apache.org <javascript:;>>
> >> > > >> >> > > >> > > > >> >> >> > > wrote:
> >> > > >> >> > > >> > > > >> >> >> > >
> >> > > >> >> > > >> > > > >> >> >> > >> Hi all,
> >> > > >> >> > > >> > > > >> >> >> > >>
> >> > > >> >> > > >> > > > >> >> >> > >> We are proposing a new feature for
> the
> >> > > >> scheduler
> >> > > >> >> > > >> updater,
> >> > > >> >> > > >> > > > which
> >> > > >> >> > > >> > > > >> you
> >> > > >> >> > > >> > > > >> >> >> > >> may find helpful.
> >> > > >> >> > > >> > > > >> >> >> > >>
> >> > > >> >> > > >> > > > >> >> >> > >> I have posed a brief feature
> summary
> >> > here:
> >> > > >> >> > > >> > > > >> >> >> > >>
> >> > > >> >> > > >> > > > >> >> >> > >>
> >> > > >> >> > > >> > > > >> >> >> >
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >>
> >> > > >> >> > > >> > > > >>
> >> > > >> >> > > >> > > >
> >> > > >> >> > > >> > >
> >> > > >> >> > > >> >
> >> > > >> >> > > >>
> >> > > >> >> > >
> >> > > >> >> >
> >> > > >> >>
> >> > > >>
> >> > >
> >> >
> >>
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> >> > > >> >> > > >> > > > >> >> >> > >>
> >> > > >> >> > > >> > > > >> >> >> > >> Please, reply with your
> >> > > >> >> > feedback/concerns/comments.
> >> > > >> >> > > >> > > > >> >> >> > >>
> >> > > >> >> > > >> > > > >> >> >> > >> Thanks,
> >> > > >> >> > > >> > > > >> >> >> > >> Maxim
> >> > > >> >> > > >> > > > >> >> >> > >>
> >> > > >> >> > > >> > > > >> >> >> >
> >> > > >> >> > > >> > > > >> >> >>
> >> > > >> >> > > >> > > > >> >>
> >> > > >> >> > > >> > > > >>
> >> > > >> >> > > >> > > >
> >> > > >> >> > > >> > >
> >> > > >> >> > > >> >
> >> > > >> >> > > >>
> >> > > >> >> > >
> >> > > >> >> >
> >> > > >> >>
> >> > > >> >>
> >> > > >> >>
> >> > > >> >> --
> >> > > >> >> Kevin Sweeney
> >> > > >> >> @kts
> >> > > >> >>
> >> > > >>
> >> > >
> >> >
> >> >
> >> > --
> >> > -=Bill
> >> >
> >>
> >
> >
> >
> > --
> > Kevin Sweeney
> > @kts
>

Re: Proposal: External Update Coordination

Reply via email to