If the goal is to reduce complexity now and add features later, why not nuke both for now - kick off the update right away, and let lack of heartbeats serve as a uniform "unknown or unhealthy" signal?
-=Bill On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko <ma...@apache.org> wrote: > I am still +1 on the idea to have default paused state on creation. I > think we could still differentiate between initially paused and timed > out states internally by looking at pause reason. It's quite different > if we want to store explicit NACK reasons from the external service > though. That would require persistence and a bit more complicated > logic. > > On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney <kevi...@apache.org> wrote: > > I like the idea of implementing this scheduler-side purely through > volatile > > state, but the lack of feedback (generic vs specific error messages when > an > > update is paused) leaves something to be desired. Maybe we can address > that > > with a metadata field in the initial call to startUpdate (with an > optional > > link to a page where one can get more rich information about the state of > > the monitor sending/not sending heartbeats). > > > > The main drawback is that we may have to wait a maximum of one heartbeat > > interval to find out that an update should be paused. > > > > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko <ma...@apache.org> > wrote: > > > >> The main reason I preferred the lack-of-ACK approach over an explicit > >> NACK one is simplicity. As Joshua pointed out there is more state to > >> handle in that case. The lack-of-ACK model can be completely > >> implemented in volatile memory sidestepping the persistent storage > >> entirely. With the NACK we would need to reliably persist external > >> service call reasons to survive scheduler failovers. Not a huge > >> challenge but something to keep in mind. > >> > >> I still think the simplicity/reliability tradeoff is acceptable here > >> if we rely on external service to abort heartbeats in case of a health > >> alert fired. This can be explicitly documented as an external > >> integration requirement. However, If the consensus is to go a more > >> reliable (though more complicated) NACK route I am happy to reconsider > >> the current proposal. > >> > >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen <jco...@twopensource.com> > >> wrote: > >> > "The heratbeatJobUpdate RPC serves as an ACK, but we don't have a > NACK. > >> If > >> > we are going to let lack-of-ACK serve as the NACK, i don't think it's > >> safe > >> > to resume when we receive another ACK. In other words, a service > >> toggling > >> > unhealthy might not be deemed safe to proceed." > >> > > >> > Lack-of-ACK is the scenario where connectivity between the monitor and > >> the > >> > scheduler is unavailable. Shouldn't the NACK scenario (everything is > not > >> > ok!) be handled by the monitoring service triggering an explicit > pause? > >> > I.e. section 2 should be updated to say "External service detects > service > >> > health problems and pauses the update" and section 4 becomes the > current > >> > section 2 (i.e. "Should a heartbeat not be received the scheduler > pauses > >> > the update."). > >> > > >> > I agree that it's unsafe to to resume updates after receiving a > heartbeat > >> > after previously pausing due to a missed heartbeat. In that scenario > I'd > >> > think we'd want an explicit resumeJobUpdate. If the scenario we're > trying > >> > to handle is *never* received a heartbeat, that's a separate matter, > in > >> > that case unpausing upon receiving the first heartbeat would make > sense, > >> > but it feels like that complicates things quite a bit (now we need to > >> > differentiate between heartbeat #1 and hearbeat #N). > >> > > >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org> > wrote: > >> > > >> >> What is the guidance for deploying while the heartbeat service is > >> broken? > >> >> I think i know the answer, but it's important to spell out. > >> >> > >> >> > >> >> > >> >> > Create a new coordinated job update in a paused > (ROLL_FORWARD_PAUSED) > >> >> > state to avoid any progress until the first heartbeat call arrives. > >> >> > >> >> > >> >> I'm not sold on this being ultimately beneficial. In the worst case, > >> >> impact is still limited by the health check threshold. Seems like > >> >> premature optimization at best, and an odd one if we proceed without > a > >> >> 'NACK' signal via the heartbeatJobUpdate RPC. > >> >> > >> >> Allow resuming of the paused-due-to-no-heartbeat update via a > >> >> > resumeJobUpdate call. > >> >> > >> >> > >> >> Are heartbeats required while rolling back? If so, that might impact > >> the > >> >> design here and in other places. > >> >> > >> >> Allow resuming of the paused-due-to-no-heartbeat update via a fresh > >> >> > heartbeatJobUpdate call. > >> >> > >> >> > >> >> The heratbeatJobUpdate RPC serves as an ACK, but we don't have a > NACK. > >> If > >> >> we are going to let lack-of-ACK serve as the NACK, i don't think it's > >> safe > >> >> to resume when we receive another ACK. In other words, a service > >> toggling > >> >> unhealthy might not be deemed safe to proceed. > >> >> > >> >> Perhaps just sending OK (or a NOOP equivalent) in case of a > user-paused > >> job > >> >> > update would make more sense as there is nothing monitoring service > >> could > >> >> > do in that case. This should work fine with pause/resume > >> -aware/-agnostic > >> >> > monitoring service implementation. > >> >> > >> >> > >> >> This seems reasonable to me - heartbeats for a paused update should > not > >> >> pose a risk, but can be safely ignored. > >> >> > >> >> > >> >> > >> >> -=Bill > >> >> > >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko < > ma...@apache.org> > >> >> wrote: > >> >> > >> >> > Agreed. That would be a logical generalization of the post failover > >> >> > behavior. > >> >> > > >> >> > I have updated the above document with the following changes: > >> >> > - Reply with PAUSED any time a job was paused by user; > >> >> > - Start in paused state by default. > >> >> > > >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney < > kevi...@apache.org> > >> >> > wrote: > >> >> > > The doc mentioned that the scheduler will start an update > subject to > >> >> the > >> >> > > heartbeat countdown, and if it doesn't receive a heartbeat it > will > >> >> pause > >> >> > > the update. Why not start with the update > >> paused-due-to-no-heartbeat to > >> >> > > fail-fast any connectivity issues between the service providing > the > >> >> > > heartbeats and the scheduler? > >> >> > > > >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko < > >> ma...@apache.org> > >> >> > > wrote: > >> >> > > > >> >> > >> Hi all, > >> >> > >> > >> >> > >> We are proposing a new feature for the scheduler updater, which > you > >> >> > >> may find helpful. > >> >> > >> > >> >> > >> I have posed a brief feature summary here: > >> >> > >> > >> >> > >> > >> >> > > >> >> > >> > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > >> >> > >> > >> >> > >> Please, reply with your feedback/concerns/comments. > >> >> > >> > >> >> > >> Thanks, > >> >> > >> Maxim > >> >> > >> > >> >> > > >> >> > >> >