Re: Proposal: External Update Coordination

Bill Farner Mon, 13 Oct 2014 15:32:20 -0700

>
> the generic paused-waiting-for-heartbeat message will be quickly replaced
> by "high 502 rate"



>From the doc Maxim linked, i don't believe that's the plan:

External service detects service health problems and stops heartbeats
> Heartbeat timeout occurs. Scheduler pauses the update.


-=Bill

On Mon, Oct 13, 2014 at 3:13 PM, Kevin Sweeney <kevi...@apache.org> wrote:

> If the service sending the heartbeat RPC is working, the generic
> paused-waiting-for-heartbeat message will be quickly replaced by "high 502
> rate". If it's not working (or has connectivity issues) we at least won't
> give a false sense of progress.
>
> On Mon, Oct 13, 2014 at 3:09 PM, Bill Farner <wfar...@apache.org> wrote:
>
> > Re: user experience, NACK-via-timeout fails here as well.
> >
> > "PAUSED - Heartbeat not received in 60s" is objectively worse than
> "PAUSED
> > - Heartbeat failed: high 502 rate".
> >
> > This is part of the impedance mismatch i'm calling out.
> >
> > -=Bill
> >
> > On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney <kevi...@apache.org>
> wrote:
> >
> > > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org>
> wrote:
> > >
> > > > What is the guidance for deploying while the heartbeat service is
> > broken?
> > > > I think i know the answer, but it's important to spell out.
> > > >
> > > >
> > > >
> > > > > Create a new coordinated job update in a paused
> (ROLL_FORWARD_PAUSED)
> > > > > state to avoid any progress until the first heartbeat call arrives.
> > > >
> > > >
> > > > I'm not sold on this being ultimately beneficial.  In the worst case,
> > > > impact is still limited by the health check threshold.  Seems like
> > > > premature optimization at best, and an odd one if we proceed without
> a
> > > > 'NACK' signal via the heartbeatJobUpdate RPC.
> > >
> > > The benefit is huge IMO for quickly detecting connectivity issues
> between
> > > the scheduler and the heartbeat service. There's a lot more information
> > > contained in the first successful heartbeat than the second, plus we
> can
> > > show the user a message like "PAUSED - Waiting for heartbeat". This is
> a
> > > better user experience than waiting for a timeout before revealing that
> > > progress will never be made.
> > >
> > >
> > > >
> > > >
> > > Allow resuming of the paused-due-to-no-heartbeat update via a
> > > > > resumeJobUpdate call.
> > > >
> > > >
> > > > Are heartbeats required while rolling back?  If so, that might impact
> > the
> > > > design here and in other places.
> > > >
> > > > Allow resuming of the paused-due-to-no-heartbeat update via a fresh
> > > > > heartbeatJobUpdate call.
> > > >
> > > >
> > > > The heratbeatJobUpdate RPC serves as an ACK, but we don't have a
> NACK.
> > > If
> > > > we are going to let lack-of-ACK serve as the NACK, i don't think it's
> > > safe
> > > > to resume when we receive another ACK.  In other words, a service
> > > toggling
> > > > unhealthy might not be deemed safe to proceed.
> > > >
> > > > Perhaps just sending OK (or a NOOP equivalent) in case of a
> user-paused
> > > job
> > > > > update would make more sense as there is nothing monitoring service
> > > could
> > > > > do in that case. This should work fine with pause/resume
> > > -aware/-agnostic
> > > > > monitoring service implementation.
> > > >
> > > >
> > > > This seems reasonable to me - heartbeats for a paused update should
> not
> > > > pose a risk, but can be safely ignored.
> > > >
> > > >
> > > >
> > > > -=Bill
> > > >
> > > > On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <
> ma...@apache.org>
> > > > wrote:
> > > >
> > > > > Agreed. That would be a logical generalization of the post failover
> > > > > behavior.
> > > > >
> > > > > I have updated the above document with the following changes:
> > > > > - Reply with PAUSED any time a job was paused by user;
> > > > > - Start in paused state by default.
> > > > >
> > > > > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <
> kevi...@apache.org>
> > > > > wrote:
> > > > > > The doc mentioned that the scheduler will start an update subject
> > to
> > > > the
> > > > > > heartbeat countdown, and if it doesn't receive a heartbeat it
> will
> > > > pause
> > > > > > the update. Why not start with the update
> > paused-due-to-no-heartbeat
> > > to
> > > > > > fail-fast any connectivity issues between the service providing
> the
> > > > > > heartbeats and the scheduler?
> > > > > >
> > > > > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <
> > > ma...@apache.org>
> > > > > > wrote:
> > > > > >
> > > > > >> Hi all,
> > > > > >>
> > > > > >> We are proposing a new feature for the scheduler updater, which
> > you
> > > > > >> may find helpful.
> > > > > >>
> > > > > >> I have posed a brief feature summary here:
> > > > > >>
> > > > > >>
> > > > >
> > > >
> > >
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > > > > >>
> > > > > >> Please, reply with your feedback/concerns/comments.
> > > > > >>
> > > > > >> Thanks,
> > > > > >> Maxim
> > > > > >>
> > > > >
> > > >
> > >
> >
>

Re: Proposal: External Update Coordination

Reply via email to