Re: Heartbeat mechanism auditing

Bill Farner Thu, 29 Jan 2015 15:53:07 -0800

Here's the permalink to the thread in question:
http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCAOTkfX7x2oipk4ZFysoS0uWZRizOnKJA3y15pvEW5K4YnUHw-A%40mail.gmail.com%3E


-=Bill

On Thu, Jan 29, 2015 at 2:45 PM, Maxim Khutornenko <ma...@apache.org> wrote:

> To add a bit of history to the topic, the current design has been
> debated heavily here [1] and an active/lazy consensus was reached
> around implementing the first iteration as lightweight as possible
> without persisting any durable state.
>
> My take on this - we should proceed as originally proposed given the
> following:
>
> - History of heartbeats is the only feature that requires state
> persistence. Nothing else in the current design benefits from
> persisting the state across restarts. I consider pulse history as a
> nice to have rather than a requirement (unlike the current state
> reporting, which is a must for troubleshooting and is racked by
> AURORA-1049).
>
> - State persistence will come with additional complexity of handling
> corner cases (restart, abort, resume, etc.) that is not well justified
> at this point given our total lack of experience with heartbeats.
>
> - Adding pulse history tracking can be done at later stages (as the
> feature evolves and we gain more insight) without the adverse user
> impact or technical debt. On the contrary, if attempted early the
> overlooked details may hurt down the road by requiring Thrift schema
> migration.
>
> Thanks,
> Maxim
>
> [1] -
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser
>
> On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin
> <dmclaugh...@apache.org> wrote:
> > Hi all,
> >
> > There is a little bit of a stalemate with regards to the implementation
> of
> > the pulse RPC in the scheduler.
> >
> > As a brief overview of this feature - the pulse RPC is designed so that
> an
> > external service can monitor the new in-scheduler updates reliably. This
> > external service could be doing something like keeping an eye on
> > application level alerts and pausing the update if things slip into a bad
> > state. The purpose of the pulse is to make sure the update does not
> > continue if it's not being monitored (i.e. the external service might
> have
> > failed) by requiring positive acknowledgement at a given time interval.
> >
> > The implementation is in this review:
> https://reviews.apache.org/r/30225/
> >
> > The contention is around whether or not the "blocked" state deserves its
> > own explicit state in the update state machine, and whether this is
> > important enough to block the review. Currently any blocked updates are
> > only known to the scheduler and the update will show as
> > UPDATING/ROLLING_FORWARD in the UI and any history that the update was
> > blocked will be lost - we only track current state.
> >
> > If you have any opinions on this feature, please feel free to chime in to
> > the RB!
> >
> > Thanks,
> > David
>

Re: Heartbeat mechanism auditing

Reply via email to