I'm actually beginning to think that an explicit state for "waiting for a heartbeat" might be easier to implement than volatile state. In a world where job updates are fully automated, i could see a bunch of users asking why a job update made no progress for a period of time, so it's really nice if the administrator doesn't need to dig into logs to piece history back together.
Maxim - can you elaborate on an example of the complexity you are concerned about? -=Bill On Thu, Jan 29, 2015 at 3:52 PM, Bill Farner <wfar...@apache.org> wrote: > Here's the permalink to the thread in question: > http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCAOTkfX7x2oipk4ZFysoS0uWZRizOnKJA3y15pvEW5K4YnUHw-A%40mail.gmail.com%3E > > -=Bill > > On Thu, Jan 29, 2015 at 2:45 PM, Maxim Khutornenko <ma...@apache.org> > wrote: > >> To add a bit of history to the topic, the current design has been >> debated heavily here [1] and an active/lazy consensus was reached >> around implementing the first iteration as lightweight as possible >> without persisting any durable state. >> >> My take on this - we should proceed as originally proposed given the >> following: >> >> - History of heartbeats is the only feature that requires state >> persistence. Nothing else in the current design benefits from >> persisting the state across restarts. I consider pulse history as a >> nice to have rather than a requirement (unlike the current state >> reporting, which is a must for troubleshooting and is racked by >> AURORA-1049). >> >> - State persistence will come with additional complexity of handling >> corner cases (restart, abort, resume, etc.) that is not well justified >> at this point given our total lack of experience with heartbeats. >> >> - Adding pulse history tracking can be done at later stages (as the >> feature evolves and we gain more insight) without the adverse user >> impact or technical debt. On the contrary, if attempted early the >> overlooked details may hurt down the road by requiring Thrift schema >> migration. >> >> Thanks, >> Maxim >> >> [1] - >> http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/browser >> >> On Thu, Jan 29, 2015 at 2:07 PM, David McLaughlin >> <dmclaugh...@apache.org> wrote: >> > Hi all, >> > >> > There is a little bit of a stalemate with regards to the implementation >> of >> > the pulse RPC in the scheduler. >> > >> > As a brief overview of this feature - the pulse RPC is designed so that >> an >> > external service can monitor the new in-scheduler updates reliably. This >> > external service could be doing something like keeping an eye on >> > application level alerts and pausing the update if things slip into a >> bad >> > state. The purpose of the pulse is to make sure the update does not >> > continue if it's not being monitored (i.e. the external service might >> have >> > failed) by requiring positive acknowledgement at a given time interval. >> > >> > The implementation is in this review: >> https://reviews.apache.org/r/30225/ >> > >> > The contention is around whether or not the "blocked" state deserves its >> > own explicit state in the update state machine, and whether this is >> > important enough to block the review. Currently any blocked updates are >> > only known to the scheduler and the update will show as >> > UPDATING/ROLLING_FORWARD in the UI and any history that the update was >> > blocked will be lost - we only track current state. >> > >> > If you have any opinions on this feature, please feel free to chime in >> to >> > the RB! >> > >> > Thanks, >> > David >> > >