Re: Proposal: External Update Coordination

Kevin Sweeney Mon, 13 Oct 2014 15:04:51 -0700

On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org> wrote:


> What is the guidance for deploying while the heartbeat service is broken?
> I think i know the answer, but it's important to spell out.
>
>
>
> > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED)
> > state to avoid any progress until the first heartbeat call arrives.
>
>
> I'm not sold on this being ultimately beneficial.  In the worst case,
> impact is still limited by the health check threshold.  Seems like
> premature optimization at best, and an odd one if we proceed without a
> 'NACK' signal via the heartbeatJobUpdate RPC.

The benefit is huge IMO for quickly detecting connectivity issues between
the scheduler and the heartbeat service. There's a lot more information
contained in the first successful heartbeat than the second, plus we can
show the user a message like "PAUSED - Waiting for heartbeat". This is a
better user experience than waiting for a timeout before revealing that
progress will never be made.


>
>
Allow resuming of the paused-due-to-no-heartbeat update via a
> > resumeJobUpdate call.
>
>
> Are heartbeats required while rolling back?  If so, that might impact the
> design here and in other places.
>
> Allow resuming of the paused-due-to-no-heartbeat update via a fresh
> > heartbeatJobUpdate call.
>
>
> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK.  If
> we are going to let lack-of-ACK serve as the NACK, i don't think it's safe
> to resume when we receive another ACK.  In other words, a service toggling
> unhealthy might not be deemed safe to proceed.
>
> Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused job
> > update would make more sense as there is nothing monitoring service could
> > do in that case. This should work fine with pause/resume -aware/-agnostic
> > monitoring service implementation.
>
>
> This seems reasonable to me - heartbeats for a paused update should not
> pose a risk, but can be safely ignored.
>
>
>
> -=Bill
>
> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
> > Agreed. That would be a logical generalization of the post failover
> > behavior.
> >
> > I have updated the above document with the following changes:
> > - Reply with PAUSED any time a job was paused by user;
> > - Start in paused state by default.
> >
> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <kevi...@apache.org>
> > wrote:
> > > The doc mentioned that the scheduler will start an update subject to
> the
> > > heartbeat countdown, and if it doesn't receive a heartbeat it will
> pause
> > > the update. Why not start with the update paused-due-to-no-heartbeat to
> > > fail-fast any connectivity issues between the service providing the
> > > heartbeats and the scheduler?
> > >
> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <ma...@apache.org>
> > > wrote:
> > >
> > >> Hi all,
> > >>
> > >> We are proposing a new feature for the scheduler updater, which you
> > >> may find helpful.
> > >>
> > >> I have posed a brief feature summary here:
> > >>
> > >>
> >
> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md
> > >>
> > >> Please, reply with your feedback/concerns/comments.
> > >>
> > >> Thanks,
> > >> Maxim
> > >>
> >
>

Re: Proposal: External Update Coordination

Reply via email to