Here's the permalink to the thread in question: http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCAOTkfX7x2oipk4ZFysoS0uWZRizOnKJA3y15pvEW5K4YnUHw-A%40mail.gmail.com%3E
-=Bill On Thu, Jan 29, 2015 at 2:45 PM, Maxim Khutornenko <ma...@apache.org> wrote: > To add a bit of history to the topic, the current design has been > debated heavily here [1] and an active/lazy consensus was reached > around implementing the first iteration as lightweight as possible > without persisting any durable state. > > My take on this - we should proceed as originally proposed given the > following: > > - History of heartbeats is the only feature that requires state > persistence. Nothing else in the current design benefits from > persisting the state across restarts. I consider pulse history as a > nice to have rather than a requirement (unlike the current state > reporting, which is a must for troubleshooting and is racked by > AURORA-1049). > > - State persistence will come with additional complexity of handling > corner cases (restart, abort, resume, etc.) that is not well justified > at this point given our total lack of experience with heartbeats. > > - Adding pulse history tracking can be done at later stages (as the > feature evolves and we gain more insight) without the adverse user > impact or technical debt. On the contrary, if attempted early the > overlooked details may hurt down the road by requiring Thrift schema > migration. > > Thanks, > Maxim > > [1] - > http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser > > On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin > <dmclaugh...@apache.org> wrote: > > Hi all, > > > > There is a little bit of a stalemate with regards to the implementation > of > > the pulse RPC in the scheduler. > > > > As a brief overview of this feature - the pulse RPC is designed so that > an > > external service can monitor the new in-scheduler updates reliably. This > > external service could be doing something like keeping an eye on > > application level alerts and pausing the update if things slip into a bad > > state. The purpose of the pulse is to make sure the update does not > > continue if it's not being monitored (i.e. the external service might > have > > failed) by requiring positive acknowledgement at a given time interval. > > > > The implementation is in this review: > https://reviews.apache.org/r/30225/ > > > > The contention is around whether or not the "blocked" state deserves its > > own explicit state in the update state machine, and whether this is > > important enough to block the review. Currently any blocked updates are > > only known to the scheduler and the update will show as > > UPDATING/ROLLING_FORWARD in the UI and any history that the update was > > blocked will be lost - we only track current state. > > > > If you have any opinions on this feature, please feel free to chime in to > > the RB! > > > > Thanks, > > David >