I'm not arguing against the merits of the approach. Just feeling out whether that should be done _after_ the rest of the heartbeat support. Seems like it can be cleanly added at the end to get something usable earlier.
-=Bill On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney <kevi...@apache.org> wrote: > I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy > signal, and punting on a more complex NACK signal (which we'd have to > reliably persist). > > I think the only disagreement in this thread is whether the default state > for a new update should be running or waiting-for-heartbeat. I think > waiting for a heartbeat is not only a more correct implementation (no risk > of acting after a failover but before the heartbeat timeout) but simpler to > implement (initialize the PulseMonitor data structure as empty rather than > with a synthetic heartbeat). > > From an API consumer perspective the sequence is: > > 1. API client sends a startUpdate RPC to the scheduler > 2. API client receives an OK response, then arranges for something to call > heartbeat with that updateId on some interval > 3. Whatever is supposed to send heartbeats sends one immediately, then > starts sending them on some smaller interval > > Waiting for the first heartbeat ensures that this sequence has been > completed successfully, while not waiting for it only ensure that step 1 > has happened. > > > On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner <wfar...@apache.org> wrote: > > > Wait - simpler solution than what? We're talking about not doing either. > > > > -=Bill > > > > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney <kevi...@apache.org> > > wrote: > > > > > I think waiting for the first heartbeat before taking any action is the > > > simpler solution here as it allows the implementation to be entirely > > > soft-state and still catches the bugs I described. > > > > > > The implementation is just PulseMonitorImpl<UpdateId> - heartbeat calls > > > pulse and mutation operations check isAlive. I think the code might > > > actually work as-is. > > > > > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko <ma...@apache.org> > > > wrote: > > > > > > > Pausing update on creation seems like a logical approach when dealing > > > > with inverted dependency model. I.e. updater is happy to act as long > > > > as it's greenlighted by the external signal. It's also aligned with a > > > > failover experience where coordinated updates are rehydrated in > paused > > > > state waiting for HB awakening. That said, I am OK punting it for the > > > > sake of simplicity for now. > > > > > > > > Kevin? > > > > > > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner <wfar...@apache.org> > > > wrote: > > > > > If the goal is to reduce complexity now and add features later, why > > not > > > > > nuke both for now - kick off the update right away, and let lack of > > > > > heartbeats serve as a uniform "unknown or unhealthy" signal? > > > > > > > > > > -=Bill > > > > > > > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko < > ma...@apache.org > > > > > > > wrote: > > > > > > > > > >> I am still +1 on the idea to have default paused state on > creation. > > I > > > > >> think we could still differentiate between initially paused and > > timed > > > > >> out states internally by looking at pause reason. It's quite > > different > > > > >> if we want to store explicit NACK reasons from the external > service > > > > >> though. That would require persistence and a bit more complicated > > > > >> logic. > > > > >> > > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney < > kevi...@apache.org> > > > > wrote: > > > > >> > I like the idea of implementing this scheduler-side purely > through > > > > >> volatile > > > > >> > state, but the lack of feedback (generic vs specific error > > messages > > > > when > > > > >> an > > > > >> > update is paused) leaves something to be desired. Maybe we can > > > address > > > > >> that > > > > >> > with a metadata field in the initial call to startUpdate (with > an > > > > >> optional > > > > >> > link to a page where one can get more rich information about the > > > > state of > > > > >> > the monitor sending/not sending heartbeats). > > > > >> > > > > > >> > The main drawback is that we may have to wait a maximum of one > > > > heartbeat > > > > >> > interval to find out that an update should be paused. > > > > >> > > > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko < > > > ma...@apache.org> > > > > >> wrote: > > > > >> > > > > > >> >> The main reason I preferred the lack-of-ACK approach over an > > > explicit > > > > >> >> NACK one is simplicity. As Joshua pointed out there is more > state > > > to > > > > >> >> handle in that case. The lack-of-ACK model can be completely > > > > >> >> implemented in volatile memory sidestepping the persistent > > storage > > > > >> >> entirely. With the NACK we would need to reliably persist > > external > > > > >> >> service call reasons to survive scheduler failovers. Not a huge > > > > >> >> challenge but something to keep in mind. > > > > >> >> > > > > >> >> I still think the simplicity/reliability tradeoff is acceptable > > > here > > > > >> >> if we rely on external service to abort heartbeats in case of a > > > > health > > > > >> >> alert fired. This can be explicitly documented as an external > > > > >> >> integration requirement. However, If the consensus is to go a > > more > > > > >> >> reliable (though more complicated) NACK route I am happy to > > > > reconsider > > > > >> >> the current proposal. > > > > >> >> > > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen < > > > > jco...@twopensource.com> > > > > >> >> wrote: > > > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but we don't > > have a > > > > >> NACK. > > > > >> >> If > > > > >> >> > we are going to let lack-of-ACK serve as the NACK, i don't > > think > > > > it's > > > > >> >> safe > > > > >> >> > to resume when we receive another ACK. In other words, a > > service > > > > >> >> toggling > > > > >> >> > unhealthy might not be deemed safe to proceed." > > > > >> >> > > > > > >> >> > Lack-of-ACK is the scenario where connectivity between the > > > monitor > > > > and > > > > >> >> the > > > > >> >> > scheduler is unavailable. Shouldn't the NACK scenario > > (everything > > > > is > > > > >> not > > > > >> >> > ok!) be handled by the monitoring service triggering an > > explicit > > > > >> pause? > > > > >> >> > I.e. section 2 should be updated to say "External service > > detects > > > > >> service > > > > >> >> > health problems and pauses the update" and section 4 becomes > > the > > > > >> current > > > > >> >> > section 2 (i.e. "Should a heartbeat not be received the > > scheduler > > > > >> pauses > > > > >> >> > the update."). > > > > >> >> > > > > > >> >> > I agree that it's unsafe to to resume updates after > receiving a > > > > >> heartbeat > > > > >> >> > after previously pausing due to a missed heartbeat. In that > > > > scenario > > > > >> I'd > > > > >> >> > think we'd want an explicit resumeJobUpdate. If the scenario > > > we're > > > > >> trying > > > > >> >> > to handle is *never* received a heartbeat, that's a separate > > > > matter, > > > > >> in > > > > >> >> > that case unpausing upon receiving the first heartbeat would > > make > > > > >> sense, > > > > >> >> > but it feels like that complicates things quite a bit (now we > > > need > > > > to > > > > >> >> > differentiate between heartbeat #1 and hearbeat #N). > > > > >> >> > > > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner < > > wfar...@apache.org > > > > > > > > >> wrote: > > > > >> >> > > > > > >> >> >> What is the guidance for deploying while the heartbeat > service > > > is > > > > >> >> broken? > > > > >> >> >> I think i know the answer, but it's important to spell out. > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > Create a new coordinated job update in a paused > > > > >> (ROLL_FORWARD_PAUSED) > > > > >> >> >> > state to avoid any progress until the first heartbeat call > > > > arrives. > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> I'm not sold on this being ultimately beneficial. In the > > worst > > > > case, > > > > >> >> >> impact is still limited by the health check threshold. > Seems > > > like > > > > >> >> >> premature optimization at best, and an odd one if we proceed > > > > without > > > > >> a > > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC. > > > > >> >> >> > > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update via > a > > > > >> >> >> > resumeJobUpdate call. > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> Are heartbeats required while rolling back? If so, that > might > > > > impact > > > > >> >> the > > > > >> >> >> design here and in other places. > > > > >> >> >> > > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat update via > a > > > > fresh > > > > >> >> >> > heartbeatJobUpdate call. > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but we don't > > have a > > > > >> NACK. > > > > >> >> If > > > > >> >> >> we are going to let lack-of-ACK serve as the NACK, i don't > > think > > > > it's > > > > >> >> safe > > > > >> >> >> to resume when we receive another ACK. In other words, a > > > service > > > > >> >> toggling > > > > >> >> >> unhealthy might not be deemed safe to proceed. > > > > >> >> >> > > > > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in case of a > > > > >> user-paused > > > > >> >> job > > > > >> >> >> > update would make more sense as there is nothing > monitoring > > > > service > > > > >> >> could > > > > >> >> >> > do in that case. This should work fine with pause/resume > > > > >> >> -aware/-agnostic > > > > >> >> >> > monitoring service implementation. > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> This seems reasonable to me - heartbeats for a paused update > > > > should > > > > >> not > > > > >> >> >> pose a risk, but can be safely ignored. > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> > > > > >> >> >> -=Bill > > > > >> >> >> > > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko < > > > > >> ma...@apache.org> > > > > >> >> >> wrote: > > > > >> >> >> > > > > >> >> >> > Agreed. That would be a logical generalization of the post > > > > failover > > > > >> >> >> > behavior. > > > > >> >> >> > > > > > >> >> >> > I have updated the above document with the following > > changes: > > > > >> >> >> > - Reply with PAUSED any time a job was paused by user; > > > > >> >> >> > - Start in paused state by default. > > > > >> >> >> > > > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney < > > > > >> kevi...@apache.org> > > > > >> >> >> > wrote: > > > > >> >> >> > > The doc mentioned that the scheduler will start an > update > > > > >> subject to > > > > >> >> >> the > > > > >> >> >> > > heartbeat countdown, and if it doesn't receive a > heartbeat > > > it > > > > >> will > > > > >> >> >> pause > > > > >> >> >> > > the update. Why not start with the update > > > > >> >> paused-due-to-no-heartbeat to > > > > >> >> >> > > fail-fast any connectivity issues between the service > > > > providing > > > > >> the > > > > >> >> >> > > heartbeats and the scheduler? > > > > >> >> >> > > > > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko < > > > > >> >> ma...@apache.org> > > > > >> >> >> > > wrote: > > > > >> >> >> > > > > > > >> >> >> > >> Hi all, > > > > >> >> >> > >> > > > > >> >> >> > >> We are proposing a new feature for the scheduler > updater, > > > > which > > > > >> you > > > > >> >> >> > >> may find helpful. > > > > >> >> >> > >> > > > > >> >> >> > >> I have posed a brief feature summary here: > > > > >> >> >> > >> > > > > >> >> >> > >> > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > > > > > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > > > >> >> >> > >> > > > > >> >> >> > >> Please, reply with your feedback/concerns/comments. > > > > >> >> >> > >> > > > > >> >> >> > >> Thanks, > > > > >> >> >> > >> Maxim > > > > >> >> >> > >> > > > > >> >> >> > > > > > >> >> >> > > > > >> >> > > > > >> > > > > > > > > > >