Re: Heartbeat mechanism auditing

Bill Farner Thu, 29 Jan 2015 16:13:43 -0800

I'm actually beginning to think that an explicit state for "waiting for a
heartbeat" might be easier to implement than volatile state.  In a world
where job updates are fully automated, i could see a bunch of users asking
why a job update made no progress for a period of time, so it's really nice
if the administrator doesn't need to dig into logs to piece history back
together.


Maxim - can you elaborate on an example of the complexity you are concerned
about?


-=Bill

On Thu, Jan 29, 2015 at 3:52 PM, Bill Farner <wfar...@apache.org> wrote:

> Here's the permalink to the thread in question:
> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCAOTkfX7x2oipk4ZFysoS0uWZRizOnKJA3y15pvEW5K4YnUHw-A%40mail.gmail.com%3E
>
> -=Bill
>
> On Thu, Jan 29, 2015 at 2:45 PM, Maxim Khutornenko <ma...@apache.org>
> wrote:
>
>> To add a bit of history to the topic, the current design has been
>> debated heavily here [1] and an active/lazy consensus was reached
>> around implementing the first iteration as lightweight as possible
>> without persisting any durable state.
>>
>> My take on this - we should proceed as originally proposed given the
>> following:
>>
>> - History of heartbeats is the only feature that requires state
>> persistence. Nothing else in the current design benefits from
>> persisting the state across restarts. I consider pulse history as a
>> nice to have rather than a requirement (unlike the current state
>> reporting, which is a must for troubleshooting and is racked by
>> AURORA-1049).
>>
>> - State persistence will come with additional complexity of handling
>> corner cases (restart, abort, resume, etc.) that is not well justified
>> at this point given our total lack of experience with heartbeats.
>>
>> - Adding pulse history tracking can be done at later stages (as the
>> feature evolves and we gain more insight) without the adverse user
>> impact or technical debt. On the contrary, if attempted early the
>> overlooked details may hurt down the road by requiring Thrift schema
>> migration.
>>
>> Thanks,
>> Maxim
>>
>> [1] -
>> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser
>>
>> On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin
>> <dmclaugh...@apache.org> wrote:
>> > Hi all,
>> >
>> > There is a little bit of a stalemate with regards to the implementation
>> of
>> > the pulse RPC in the scheduler.
>> >
>> > As a brief overview of this feature - the pulse RPC is designed so that
>> an
>> > external service can monitor the new in-scheduler updates reliably. This
>> > external service could be doing something like keeping an eye on
>> > application level alerts and pausing the update if things slip into a
>> bad
>> > state. The purpose of the pulse is to make sure the update does not
>> > continue if it's not being monitored (i.e. the external service might
>> have
>> > failed) by requiring positive acknowledgement at a given time interval.
>> >
>> > The implementation is in this review:
>> https://reviews.apache.org/r/30225/
>> >
>> > The contention is around whether or not the "blocked" state deserves its
>> > own explicit state in the update state machine, and whether this is
>> > important enough to block the review. Currently any blocked updates are
>> > only known to the scheduler and the update will show as
>> > UPDATING/ROLLING_FORWARD in the UI and any history that the update was
>> > blocked will be lost - we only track current state.
>> >
>> > If you have any opinions on this feature, please feel free to chime in
>> to
>> > the RB!
>> >
>> > Thanks,
>> > David
>>
>
>

Re: Heartbeat mechanism auditing

Reply via email to