Re: Heartbeat mechanism auditing

David McLaughlin Thu, 29 Jan 2015 16:22:09 -0800

On Thu, Jan 29, 2015 at 2:45 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> To add a bit of history to the topic, the current design has been
> debated heavily here [1] and an active/lazy consensus was reached
> around implementing the first iteration as lightweight as possible
> without persisting any durable state.
>

My understanding at least around not persisting durable state referred to
not having the last pulse time persisted that would need to be written for
n updates every x seconds. This is a very different concern to have a
persistent state for blocked, which would only be written n times and only
in (hopefully) rare failure cases.

>
> My take on this - we should proceed as originally proposed given the
> following:
>
> - History of heartbeats is the only feature that requires state
> persistence. Nothing else in the current design benefits from
> persisting the state across restarts. I consider pulse history as a
> nice to have rather than a requirement (unlike the current state
> reporting, which is a must for troubleshooting and is racked by
> AURORA-1049).
>

The feature is definitely not a blocker, but for me this implementation
will make it unlikely that we will add it in the future - even now there is
a ticket for adding current state but no ticket for implementing the
historical states?

>
> - State persistence will come with additional complexity of handling
> corner cases (restart, abort, resume, etc.) that is not well justified
> at this point given our total lack of experience with heartbeats.
>

I actually think (purely from code implementation point of view) that
having blocked represented by a state incurs way less complexity than
having to get your head round this special 'state that isn't part of our
state machine' blocked flag. AURORA-1049 also suggests we would modify the
API to give a signal that UPDATE_ROLLING_FORWARD is actually not rolling
forward at all, which seems more complex than just reading and returning
state from storage.

If the state is part of the state machine we already have a way of modeling
the valid transitions. I would be interested in knowing these corner cases
because as far as I can tell, it's just treated exactly like an explicit
paused state. That way things like abort or resume can be initiated by a
user (if they just want the job to continue and not wait for external
monitoring service to be revived). The reason to have the separate state is
so the pulse RPC can differentiate between explicit and non-explicit
pauses, since pulse can initiate blocked -> rolling_forward transition.

The corner cases that concern me are still not solved in the review
implementation, as those mainly revolve around scheduler failover and the
non-deterministic leader election time (do you give a grace period on
leader election, or do you move all updates to blocked?). Again though, I
would argue that moving all updates to blocked because the scheduler failed
over would be a useful event to have in the update UI.

> - Adding pulse history tracking can be done at later stages (as the
> feature evolves and we gain more insight) without the adverse user
> impact or technical debt. On the contrary, if attempted early the
> overlooked details may hurt down the road by requiring Thrift schema
> migration.

> Thanks,
> Maxim
>
> [1] -
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser
>
> On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin
> <dmclaugh...@apache.org> wrote:
> > Hi all,
> >
> > There is a little bit of a stalemate with regards to the implementation
> of
> > the pulse RPC in the scheduler.
> >
> > As a brief overview of this feature - the pulse RPC is designed so that
> an
> > external service can monitor the new in-scheduler updates reliably. This
> > external service could be doing something like keeping an eye on
> > application level alerts and pausing the update if things slip into a bad
> > state. The purpose of the pulse is to make sure the update does not
> > continue if it's not being monitored (i.e. the external service might
> have
> > failed) by requiring positive acknowledgement at a given time interval.
> >
> > The implementation is in this review:
> https://reviews.apache.org/r/30225/
> >
> > The contention is around whether or not the "blocked" state deserves its
> > own explicit state in the update state machine, and whether this is
> > important enough to block the review. Currently any blocked updates are
> > only known to the scheduler and the update will show as
> > UPDATING/ROLLING_FORWARD in the UI and any history that the update was
> > blocked will be lost - we only track current state.
> >
> > If you have any opinions on this feature, please feel free to chime in to
> > the RB!
> >
> > Thanks,
> > David
>

Re: Heartbeat mechanism auditing

Reply via email to