To add a bit of history to the topic, the current design has been debated heavily here [1] and an active/lazy consensus was reached around implementing the first iteration as lightweight as possible without persisting any durable state.
My take on this - we should proceed as originally proposed given the following: - History of heartbeats is the only feature that requires state persistence. Nothing else in the current design benefits from persisting the state across restarts. I consider pulse history as a nice to have rather than a requirement (unlike the current state reporting, which is a must for troubleshooting and is racked by AURORA-1049). - State persistence will come with additional complexity of handling corner cases (restart, abort, resume, etc.) that is not well justified at this point given our total lack of experience with heartbeats. - Adding pulse history tracking can be done at later stages (as the feature evolves and we gain more insight) without the adverse user impact or technical debt. On the contrary, if attempted early the overlooked details may hurt down the road by requiring Thrift schema migration. Thanks, Maxim [1] - http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin <dmclaugh...@apache.org> wrote: > Hi all, > > There is a little bit of a stalemate with regards to the implementation of > the pulse RPC in the scheduler. > > As a brief overview of this feature - the pulse RPC is designed so that an > external service can monitor the new in-scheduler updates reliably. This > external service could be doing something like keeping an eye on > application level alerts and pausing the update if things slip into a bad > state. The purpose of the pulse is to make sure the update does not > continue if it's not being monitored (i.e. the external service might have > failed) by requiring positive acknowledgement at a given time interval. > > The implementation is in this review: https://reviews.apache.org/r/30225/ > > The contention is around whether or not the "blocked" state deserves its > own explicit state in the update state machine, and whether this is > important enough to block the review. Currently any blocked updates are > only known to the scheduler and the update will show as > UPDATING/ROLLING_FORWARD in the UI and any history that the update was > blocked will be lost - we only track current state. > > If you have any opinions on this feature, please feel free to chime in to > the RB! > > Thanks, > David