Re: Heartbeat mechanism auditing

Maxim Khutornenko Mon, 02 Feb 2015 12:47:48 -0800

Chatted with davmclau, wfarner, kevints and jcohen. The consensus is
to move forward with the state-based approach to ease up
troubleshooting from day one. Will update the RB unless there are
objections to this approach.


Brief design update summary:
- there will be 2 new job update states: ROLL_FORWARD_BLOCKED and
ROLL_BACK_BLOCKED to be applied on a missing pulse
- pulse timestamps are still stored in in-memory only
- on scheduler restart OR jobUpdateResume RPC call a coordinated
active update will enter a relevant _BLOCKED state IFF the pulse data
is missing. Otherwise -> the current behavior: ROLLING_[FORWARD |
BACK].

On Thu, Jan 29, 2015 at 4:19 PM, David McLaughlin <da...@dmclaughlin.com> wrote:
> On Thu, Jan 29, 2015 at 2:45 PM, Maxim Khutornenko <ma...@apache.org> wrote:
>
>> To add a bit of history to the topic, the current design has been
>> debated heavily here [1] and an active/lazy consensus was reached
>> around implementing the first iteration as lightweight as possible
>> without persisting any durable state.
>>
>
> My understanding at least around not persisting durable state referred to
> not having the last pulse time persisted that would need to be written for
> n updates every x seconds. This is a very different concern to have a
> persistent state for blocked, which would only be written n times and only
> in (hopefully) rare failure cases.
>
>
>>
>> My take on this - we should proceed as originally proposed given the
>> following:
>>
>> - History of heartbeats is the only feature that requires state
>> persistence. Nothing else in the current design benefits from
>> persisting the state across restarts. I consider pulse history as a
>> nice to have rather than a requirement (unlike the current state
>> reporting, which is a must for troubleshooting and is racked by
>> AURORA-1049).
>>
>
> The feature is definitely not a blocker, but for me this implementation
> will make it unlikely that we will add it in the future - even now there is
> a ticket for adding current state but no ticket for implementing the
> historical states?
>
>
>>
>> - State persistence will come with additional complexity of handling
>> corner cases (restart, abort, resume, etc.) that is not well justified
>> at this point given our total lack of experience with heartbeats.
>>
>
> I actually think (purely from code implementation point of view) that
> having blocked represented by a state incurs way less complexity than
> having to get your head round this special 'state that isn't part of our
> state machine' blocked flag. AURORA-1049 also suggests we would modify the
> API to give a signal that UPDATE_ROLLING_FORWARD is actually not rolling
> forward at all, which seems more complex than just reading and returning
> state from storage.
>
> If the state is part of the state machine we already have a way of modeling
> the valid transitions. I would be interested in knowing these corner cases
> because as far as I can tell, it's just treated exactly like an explicit
> paused state. That way things like abort or resume can be initiated by a
> user (if they just want the job to continue and not wait for external
> monitoring service to be revived). The reason to have the separate state is
> so the pulse RPC can differentiate between explicit and non-explicit
> pauses, since pulse can initiate blocked -> rolling_forward transition.
>
> The corner cases that concern me are still not solved in the review
> implementation, as those mainly revolve around scheduler failover and the
> non-deterministic leader election time (do you give a grace period on
> leader election, or do you move all updates to blocked?). Again though, I
> would argue that moving all updates to blocked because the scheduler failed
> over would be a useful event to have in the update UI.
>
>
>> - Adding pulse history tracking can be done at later stages (as the
>> feature evolves and we gain more insight) without the adverse user
>> impact or technical debt. On the contrary, if attempted early the
>> overlooked details may hurt down the road by requiring Thrift schema
>> migration.
>
>
>> Thanks,
>> Maxim
>>
>> [1] -
>> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser
>>
>> On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin
>> <dmclaugh...@apache.org> wrote:
>> > Hi all,
>> >
>> > There is a little bit of a stalemate with regards to the implementation
>> of
>> > the pulse RPC in the scheduler.
>> >
>> > As a brief overview of this feature - the pulse RPC is designed so that
>> an
>> > external service can monitor the new in-scheduler updates reliably. This
>> > external service could be doing something like keeping an eye on
>> > application level alerts and pausing the update if things slip into a bad
>> > state. The purpose of the pulse is to make sure the update does not
>> > continue if it's not being monitored (i.e. the external service might
>> have
>> > failed) by requiring positive acknowledgement at a given time interval.
>> >
>> > The implementation is in this review:
>> https://reviews.apache.org/r/30225/
>> >
>> > The contention is around whether or not the "blocked" state deserves its
>> > own explicit state in the update state machine, and whether this is
>> > important enough to block the review. Currently any blocked updates are
>> > only known to the scheduler and the update will show as
>> > UPDATING/ROLLING_FORWARD in the UI and any history that the update was
>> > blocked will be lost - we only track current state.
>> >
>> > If you have any opinions on this feature, please feel free to chime in to
>> > the RB!
>> >
>> > Thanks,
>> > David
>>

Re: Heartbeat mechanism auditing

Reply via email to