Chatted with davmclau, wfarner, kevints and jcohen. The consensus is to move forward with the state-based approach to ease up troubleshooting from day one. Will update the RB unless there are objections to this approach.
Brief design update summary: - there will be 2 new job update states: ROLL_FORWARD_BLOCKED and ROLL_BACK_BLOCKED to be applied on a missing pulse - pulse timestamps are still stored in in-memory only - on scheduler restart OR jobUpdateResume RPC call a coordinated active update will enter a relevant _BLOCKED state IFF the pulse data is missing. Otherwise -> the current behavior: ROLLING_[FORWARD | BACK]. On Thu, Jan 29, 2015 at 4:19 PM, David McLaughlin <da...@dmclaughlin.com> wrote: > On Thu, Jan 29, 2015 at 2:45 PM, Maxim Khutornenko <ma...@apache.org> wrote: > >> To add a bit of history to the topic, the current design has been >> debated heavily here [1] and an active/lazy consensus was reached >> around implementing the first iteration as lightweight as possible >> without persisting any durable state. >> > > My understanding at least around not persisting durable state referred to > not having the last pulse time persisted that would need to be written for > n updates every x seconds. This is a very different concern to have a > persistent state for blocked, which would only be written n times and only > in (hopefully) rare failure cases. > > >> >> My take on this - we should proceed as originally proposed given the >> following: >> >> - History of heartbeats is the only feature that requires state >> persistence. Nothing else in the current design benefits from >> persisting the state across restarts. I consider pulse history as a >> nice to have rather than a requirement (unlike the current state >> reporting, which is a must for troubleshooting and is racked by >> AURORA-1049). >> > > The feature is definitely not a blocker, but for me this implementation > will make it unlikely that we will add it in the future - even now there is > a ticket for adding current state but no ticket for implementing the > historical states? > > >> >> - State persistence will come with additional complexity of handling >> corner cases (restart, abort, resume, etc.) that is not well justified >> at this point given our total lack of experience with heartbeats. >> > > I actually think (purely from code implementation point of view) that > having blocked represented by a state incurs way less complexity than > having to get your head round this special 'state that isn't part of our > state machine' blocked flag. AURORA-1049 also suggests we would modify the > API to give a signal that UPDATE_ROLLING_FORWARD is actually not rolling > forward at all, which seems more complex than just reading and returning > state from storage. > > If the state is part of the state machine we already have a way of modeling > the valid transitions. I would be interested in knowing these corner cases > because as far as I can tell, it's just treated exactly like an explicit > paused state. That way things like abort or resume can be initiated by a > user (if they just want the job to continue and not wait for external > monitoring service to be revived). The reason to have the separate state is > so the pulse RPC can differentiate between explicit and non-explicit > pauses, since pulse can initiate blocked -> rolling_forward transition. > > The corner cases that concern me are still not solved in the review > implementation, as those mainly revolve around scheduler failover and the > non-deterministic leader election time (do you give a grace period on > leader election, or do you move all updates to blocked?). Again though, I > would argue that moving all updates to blocked because the scheduler failed > over would be a useful event to have in the update UI. > > >> - Adding pulse history tracking can be done at later stages (as the >> feature evolves and we gain more insight) without the adverse user >> impact or technical debt. On the contrary, if attempted early the >> overlooked details may hurt down the road by requiring Thrift schema >> migration. > > >> Thanks, >> Maxim >> >> [1] - >> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser >> >> On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin >> <dmclaugh...@apache.org> wrote: >> > Hi all, >> > >> > There is a little bit of a stalemate with regards to the implementation >> of >> > the pulse RPC in the scheduler. >> > >> > As a brief overview of this feature - the pulse RPC is designed so that >> an >> > external service can monitor the new in-scheduler updates reliably. This >> > external service could be doing something like keeping an eye on >> > application level alerts and pausing the update if things slip into a bad >> > state. The purpose of the pulse is to make sure the update does not >> > continue if it's not being monitored (i.e. the external service might >> have >> > failed) by requiring positive acknowledgement at a given time interval. >> > >> > The implementation is in this review: >> https://reviews.apache.org/r/30225/ >> > >> > The contention is around whether or not the "blocked" state deserves its >> > own explicit state in the update state machine, and whether this is >> > important enough to block the review. Currently any blocked updates are >> > only known to the scheduler and the update will show as >> > UPDATING/ROLLING_FORWARD in the UI and any history that the update was >> > blocked will be lost - we only track current state. >> > >> > If you have any opinions on this feature, please feel free to chime in to >> > the RB! >> > >> > Thanks, >> > David >>