+1 -=Bill
On Thu, Oct 16, 2014 at 11:21 AM, Maxim Khutornenko <ma...@apache.org> wrote: > Correct. The presence of the SessionKey does indeed mean that > heartbeats are going to be authenticated. Given that external service > has to solve authentication story to use pauseJobUpdate anyway, having > heartbeats authenticated seems like a natural progression. Also, given > our current admin thrift interface it's easier to have an > authenticated RPC in it rather than not (scheduler_client.py > automatically injects SessionKey arg into all methods that are not > part of ReadOnlyScheduler interface). > > On Thu, Oct 16, 2014 at 10:52 AM, Kevin Sweeney > <kswee...@twitter.com.invalid> wrote: > > I inferred that authentication was required due to the presence of a > > SessionKey in the RPC. Of course any authentication mechanism here could > > have serious scaling issues (barring something like HTTP basic auth in > > memory) > > > > On Thu, Oct 16, 2014 at 10:48 AM, Joshua Cohen <jco...@twopensource.com> > > wrote: > > > >> What are our thoughts about authentication with regards to heartbeats? > It > >> seems like they should be authenticated since there does exist the > >> potential for a malicious actor to send its own heartbeats even if the > real > >> monitoring service has detected a problem and ceased sending heartbeats. > >> I'm not sure exactly how large the attack surface is (if the service is > >> truly down the scheduler would detect that and roll back the update > >> regardless), but I think it's worth discussing as we work on the initial > >> design. > >> > >> On Wed, Oct 15, 2014 at 7:49 PM, Bill Farner <wfar...@apache.org> > wrote: > >> > >> > David - the plan is to synthesize the waiting state. Exactly how is > not > >> > yet certain. > >> > > >> > On Wednesday, October 15, 2014, Maxim Khutornenko <ma...@apache.org> > >> > wrote: > >> > > >> > > It is certainly possible to add new state or a status message but I > >> > > don't think it's a blocker for the first iteration. Provided there > is > >> > > enough demand a state/message could be synthesized during the 'get' > >> > > call based on the volatile state. > >> > > > >> > > On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin < > >> da...@dmclaughlin.com > >> > > <javascript:;>> wrote: > >> > > > +1 for pause being explicit RPC pauses, but does it really add > >> > complexity > >> > > > to just add a new state (WAITING?) when no heartbeat is sent? Not > >> being > >> > > > able to see that an update was blocked because of a lack of > heartbeat > >> > > seems > >> > > > like a missing feature. > >> > > > > >> > > > On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko < > ma...@apache.org > >> > > <javascript:;>> wrote: > >> > > > > >> > > >> +1. Updated the doc: > >> > > >> > >> > > >> > >> > > > >> > > >> > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > >> > > >> > >> > > >> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfar...@apache.org > >> > > <javascript:;>> wrote: > >> > > >> > +1 to the scheduler not proceeding on an update when heartbeats > >> are > >> > > >> absent, > >> > > >> > and requiring the heartbeat service to explicitly call > >> > pauseJobUpdate > >> > > >> when > >> > > >> > it detects problems. > >> > > >> > > >> > > >> > -=Bill > >> > > >> > > >> > > >> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney > >> > > >> <kswee...@twitter.com.invalid > >> > > >> >> wrote: > >> > > >> > > >> > > >> >> Chatted with Maxim and Bill, I think we figured it out > >> > > >> >> > >> > > >> >> I think the confusion stems from the fact that there are two > >> types > >> > of > >> > > >> >> pauses in this system, explicit, persisted pauses generated by > >> the > >> > > >> >> pauseJobUpdate RPC and implicit, volatile pauses caused due to > >> the > >> > > >> absence > >> > > >> >> of a sufficiently fresh heartbeat (such as in the case of a > >> network > >> > > >> >> partition). > >> > > >> >> > >> > > >> >> In case a monitoring service detects a problem it should call > the > >> > > >> explicit > >> > > >> >> pauseJobUpdate RPC, which will cause a state change that > requires > >> > an > >> > > >> >> explicit resumeJobUpdate RPC to resume. That feature already > >> > exists. > >> > > >> >> > >> > > >> >> But, we need one more thing to make this reliable - > heartbeats to > >> > > >> protect > >> > > >> >> against network partitions between the scheduler and the > >> monitoring > >> > > >> >> service. These can be volatile and lightweight - the scheduler > >> just > >> > > >> checks > >> > > >> >> for a sufficiently fresh heartbeat before it performs an > update > >> > > action, > >> > > >> and > >> > > >> >> if none is present it simply refuses to perform the action. If > >> the > >> > > >> >> partition heals a new heartbeat will arrive (if the update > being > >> > > >> monitored > >> > > >> >> should still be allowed to proceed) and the scheduler will > allow > >> > the > >> > > >> update > >> > > >> >> to proceed. > >> > > >> >> > >> > > >> >> > >> > > >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner < > >> wfar...@apache.org > >> > > <javascript:;>> > >> > > >> wrote: > >> > > >> >> > >> > > >> >> > I think we should assess that after building the rest of the > >> > > feature. > >> > > >> >> IIUC > >> > > >> >> > the rest of the code doesn't care if the update is initially > >> > > paused. > >> > > >> >> > > >> > > >> >> > -=Bill > >> > > >> >> > > >> > > >> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko < > >> > > ma...@apache.org <javascript:;> > >> > > >> > > >> > > >> >> > wrote: > >> > > >> >> > > >> > > >> >> > > Can we get a consensus here? Looks like the only sticky > point > >> > > left > >> > > >> is > >> > > >> >> > > around starting an update in paused vs. non-paused state. > I > >> can > >> > > >> argue > >> > > >> >> > > either way as it's easy to add later if needed. > >> > > >> >> > > > >> > > >> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner < > >> > wfar...@apache.org > >> > > <javascript:;>> > >> > > >> >> wrote: > >> > > >> >> > > > I'm not arguing against the merits of the approach. > Just > >> > > feeling > >> > > >> out > >> > > >> >> > > > whether that should be done _after_ the rest of the > >> heartbeat > >> > > >> >> support. > >> > > >> >> > > > Seems like it can be cleanly added at the end to get > >> > something > >> > > >> usable > >> > > >> >> > > > earlier. > >> > > >> >> > > > > >> > > >> >> > > > -=Bill > >> > > >> >> > > > > >> > > >> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney < > >> > > >> kevi...@apache.org <javascript:;>> > >> > > >> >> > > wrote: > >> > > >> >> > > > > >> > > >> >> > > >> I'm +1 for using lack of heartbeats as a uniform > >> > > >> >> unknown-or-unhealthy > >> > > >> >> > > >> signal, and punting on a more complex NACK signal > (which > >> > we'd > >> > > >> have > >> > > >> >> to > >> > > >> >> > > >> reliably persist). > >> > > >> >> > > >> > >> > > >> >> > > >> I think the only disagreement in this thread is whether > >> the > >> > > >> default > >> > > >> >> > > state > >> > > >> >> > > >> for a new update should be running or > >> > waiting-for-heartbeat. I > >> > > >> think > >> > > >> >> > > >> waiting for a heartbeat is not only a more correct > >> > > implementation > >> > > >> >> (no > >> > > >> >> > > risk > >> > > >> >> > > >> of acting after a failover but before the heartbeat > >> timeout) > >> > > but > >> > > >> >> > > simpler to > >> > > >> >> > > >> implement (initialize the PulseMonitor data structure > as > >> > empty > >> > > >> >> rather > >> > > >> >> > > than > >> > > >> >> > > >> with a synthetic heartbeat). > >> > > >> >> > > >> > >> > > >> >> > > >> From an API consumer perspective the sequence is: > >> > > >> >> > > >> > >> > > >> >> > > >> 1. API client sends a startUpdate RPC to the scheduler > >> > > >> >> > > >> 2. API client receives an OK response, then arranges > for > >> > > >> something > >> > > >> >> to > >> > > >> >> > > call > >> > > >> >> > > >> heartbeat with that updateId on some interval > >> > > >> >> > > >> 3. Whatever is supposed to send heartbeats sends one > >> > > immediately, > >> > > >> >> then > >> > > >> >> > > >> starts sending them on some smaller interval > >> > > >> >> > > >> > >> > > >> >> > > >> Waiting for the first heartbeat ensures that this > sequence > >> > has > >> > > >> been > >> > > >> >> > > >> completed successfully, while not waiting for it only > >> ensure > >> > > that > >> > > >> >> > step 1 > >> > > >> >> > > >> has happened. > >> > > >> >> > > >> > >> > > >> >> > > >> > >> > > >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner < > >> > > >> wfar...@apache.org <javascript:;>> > >> > > >> >> > > wrote: > >> > > >> >> > > >> > >> > > >> >> > > >> > Wait - simpler solution than what? We're talking > about > >> > not > >> > > >> doing > >> > > >> >> > > either. > >> > > >> >> > > >> > > >> > > >> >> > > >> > -=Bill > >> > > >> >> > > >> > > >> > > >> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney < > >> > > >> >> kevi...@apache.org <javascript:;> > >> > > >> >> > > > >> > > >> >> > > >> > wrote: > >> > > >> >> > > >> > > >> > > >> >> > > >> > > I think waiting for the first heartbeat before > taking > >> > any > >> > > >> action > >> > > >> >> > is > >> > > >> >> > > the > >> > > >> >> > > >> > > simpler solution here as it allows the > implementation > >> to > >> > > be > >> > > >> >> > entirely > >> > > >> >> > > >> > > soft-state and still catches the bugs I described. > >> > > >> >> > > >> > > > >> > > >> >> > > >> > > The implementation is just > PulseMonitorImpl<UpdateId> > >> - > >> > > >> >> heartbeat > >> > > >> >> > > calls > >> > > >> >> > > >> > > pulse and mutation operations check isAlive. I > think > >> the > >> > > code > >> > > >> >> > might > >> > > >> >> > > >> > > actually work as-is. > >> > > >> >> > > >> > > > >> > > >> >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim > Khutornenko < > >> > > >> >> > > ma...@apache.org <javascript:;>> > >> > > >> >> > > >> > > wrote: > >> > > >> >> > > >> > > > >> > > >> >> > > >> > > > Pausing update on creation seems like a logical > >> > approach > >> > > >> when > >> > > >> >> > > dealing > >> > > >> >> > > >> > > > with inverted dependency model. I.e. updater is > >> happy > >> > to > >> > > >> act > >> > > >> >> as > >> > > >> >> > > long > >> > > >> >> > > >> > > > as it's greenlighted by the external signal. It's > >> also > >> > > >> aligned > >> > > >> >> > > with a > >> > > >> >> > > >> > > > failover experience where coordinated updates are > >> > > >> rehydrated > >> > > >> >> in > >> > > >> >> > > >> paused > >> > > >> >> > > >> > > > state waiting for HB awakening. That said, I am > OK > >> > > punting > >> > > >> it > >> > > >> >> > for > >> > > >> >> > > the > >> > > >> >> > > >> > > > sake of simplicity for now. > >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > Kevin? > >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner < > >> > > >> >> > wfar...@apache.org <javascript:;> > >> > > >> >> > > > > >> > > >> >> > > >> > > wrote: > >> > > >> >> > > >> > > > > If the goal is to reduce complexity now and add > >> > > features > >> > > >> >> > later, > >> > > >> >> > > why > >> > > >> >> > > >> > not > >> > > >> >> > > >> > > > > nuke both for now - kick off the update right > >> away, > >> > > and > >> > > >> let > >> > > >> >> > > lack of > >> > > >> >> > > >> > > > > heartbeats serve as a uniform "unknown or > >> unhealthy" > >> > > >> signal? > >> > > >> >> > > >> > > > > > >> > > >> >> > > >> > > > > -=Bill > >> > > >> >> > > >> > > > > > >> > > >> >> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim > >> Khutornenko < > >> > > >> >> > > >> ma...@apache.org <javascript:;> > >> > > >> >> > > >> > > > >> > > >> >> > > >> > > > wrote: > >> > > >> >> > > >> > > > > > >> > > >> >> > > >> > > > >> I am still +1 on the idea to have default > paused > >> > > state > >> > > >> on > >> > > >> >> > > >> creation. > >> > > >> >> > > >> > I > >> > > >> >> > > >> > > > >> think we could still differentiate between > >> > initially > >> > > >> paused > >> > > >> >> > and > >> > > >> >> > > >> > timed > >> > > >> >> > > >> > > > >> out states internally by looking at pause > reason. > >> > > It's > >> > > >> >> quite > >> > > >> >> > > >> > different > >> > > >> >> > > >> > > > >> if we want to store explicit NACK reasons from > >> the > >> > > >> external > >> > > >> >> > > >> service > >> > > >> >> > > >> > > > >> though. That would require persistence and a > bit > >> > more > >> > > >> >> > > complicated > >> > > >> >> > > >> > > > >> logic. > >> > > >> >> > > >> > > > >> > >> > > >> >> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin > Sweeney < > >> > > >> >> > > >> kevi...@apache.org <javascript:;>> > >> > > >> >> > > >> > > > wrote: > >> > > >> >> > > >> > > > >> > I like the idea of implementing this > >> > scheduler-side > >> > > >> >> purely > >> > > >> >> > > >> through > >> > > >> >> > > >> > > > >> volatile > >> > > >> >> > > >> > > > >> > state, but the lack of feedback (generic vs > >> > > specific > >> > > >> >> error > >> > > >> >> > > >> > messages > >> > > >> >> > > >> > > > when > >> > > >> >> > > >> > > > >> an > >> > > >> >> > > >> > > > >> > update is paused) leaves something to be > >> desired. > >> > > >> Maybe > >> > > >> >> we > >> > > >> >> > > can > >> > > >> >> > > >> > > address > >> > > >> >> > > >> > > > >> that > >> > > >> >> > > >> > > > >> > with a metadata field in the initial call to > >> > > >> startUpdate > >> > > >> >> > > (with > >> > > >> >> > > >> an > >> > > >> >> > > >> > > > >> optional > >> > > >> >> > > >> > > > >> > link to a page where one can get more rich > >> > > information > >> > > >> >> > about > >> > > >> >> > > the > >> > > >> >> > > >> > > > state of > >> > > >> >> > > >> > > > >> > the monitor sending/not sending heartbeats). > >> > > >> >> > > >> > > > >> > > >> > > >> >> > > >> > > > >> > The main drawback is that we may have to > wait a > >> > > >> maximum > >> > > >> >> of > >> > > >> >> > > one > >> > > >> >> > > >> > > > heartbeat > >> > > >> >> > > >> > > > >> > interval to find out that an update should > be > >> > > paused. > >> > > >> >> > > >> > > > >> > > >> > > >> >> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim > >> > Khutornenko > >> > > < > >> > > >> >> > > >> > > ma...@apache.org <javascript:;>> > >> > > >> >> > > >> > > > >> wrote: > >> > > >> >> > > >> > > > >> > > >> > > >> >> > > >> > > > >> >> The main reason I preferred the lack-of-ACK > >> > > approach > >> > > >> >> over > >> > > >> >> > an > >> > > >> >> > > >> > > explicit > >> > > >> >> > > >> > > > >> >> NACK one is simplicity. As Joshua pointed > out > >> > > there > >> > > >> is > >> > > >> >> > more > >> > > >> >> > > >> state > >> > > >> >> > > >> > > to > >> > > >> >> > > >> > > > >> >> handle in that case. The lack-of-ACK model > can > >> > be > >> > > >> >> > completely > >> > > >> >> > > >> > > > >> >> implemented in volatile memory sidestepping > >> the > >> > > >> >> persistent > >> > > >> >> > > >> > storage > >> > > >> >> > > >> > > > >> >> entirely. With the NACK we would need to > >> > reliably > >> > > >> >> persist > >> > > >> >> > > >> > external > >> > > >> >> > > >> > > > >> >> service call reasons to survive scheduler > >> > > failovers. > >> > > >> >> Not a > >> > > >> >> > > huge > >> > > >> >> > > >> > > > >> >> challenge but something to keep in mind. > >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> >> I still think the simplicity/reliability > >> > tradeoff > >> > > is > >> > > >> >> > > acceptable > >> > > >> >> > > >> > > here > >> > > >> >> > > >> > > > >> >> if we rely on external service to abort > >> > > heartbeats in > >> > > >> >> case > >> > > >> >> > > of a > >> > > >> >> > > >> > > > health > >> > > >> >> > > >> > > > >> >> alert fired. This can be explicitly > documented > >> > as > >> > > an > >> > > >> >> > > external > >> > > >> >> > > >> > > > >> >> integration requirement. However, If the > >> > consensus > >> > > >> is to > >> > > >> >> > go > >> > > >> >> > > a > >> > > >> >> > > >> > more > >> > > >> >> > > >> > > > >> >> reliable (though more complicated) NACK > route > >> I > >> > am > >> > > >> happy > >> > > >> >> > to > >> > > >> >> > > >> > > > reconsider > >> > > >> >> > > >> > > > >> >> the current proposal. > >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua > Cohen > >> < > >> > > >> >> > > >> > > > jco...@twopensource.com <javascript:;>> > >> > > >> >> > > >> > > > >> >> wrote: > >> > > >> >> > > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an > >> ACK, > >> > > but > >> > > >> we > >> > > >> >> > don't > >> > > >> >> > > >> > have a > >> > > >> >> > > >> > > > >> NACK. > >> > > >> >> > > >> > > > >> >> If > >> > > >> >> > > >> > > > >> >> > we are going to let lack-of-ACK serve as > the > >> > > NACK, > >> > > >> i > >> > > >> >> > don't > >> > > >> >> > > >> > think > >> > > >> >> > > >> > > > it's > >> > > >> >> > > >> > > > >> >> safe > >> > > >> >> > > >> > > > >> >> > to resume when we receive another ACK. > In > >> > other > >> > > >> >> words, > >> > > >> >> > a > >> > > >> >> > > >> > service > >> > > >> >> > > >> > > > >> >> toggling > >> > > >> >> > > >> > > > >> >> > unhealthy might not be deemed safe to > >> > proceed." > >> > > >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > > > >> >> > Lack-of-ACK is the scenario where > >> connectivity > >> > > >> between > >> > > >> >> > the > >> > > >> >> > > >> > > monitor > >> > > >> >> > > >> > > > and > >> > > >> >> > > >> > > > >> >> the > >> > > >> >> > > >> > > > >> >> > scheduler is unavailable. Shouldn't the > NACK > >> > > >> scenario > >> > > >> >> > > >> > (everything > >> > > >> >> > > >> > > > is > >> > > >> >> > > >> > > > >> not > >> > > >> >> > > >> > > > >> >> > ok!) be handled by the monitoring service > >> > > >> triggering > >> > > >> >> an > >> > > >> >> > > >> > explicit > >> > > >> >> > > >> > > > >> pause? > >> > > >> >> > > >> > > > >> >> > I.e. section 2 should be updated to say > >> > > "External > >> > > >> >> > service > >> > > >> >> > > >> > detects > >> > > >> >> > > >> > > > >> service > >> > > >> >> > > >> > > > >> >> > health problems and pauses the update" > and > >> > > section > >> > > >> 4 > >> > > >> >> > > becomes > >> > > >> >> > > >> > the > >> > > >> >> > > >> > > > >> current > >> > > >> >> > > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not > be > >> > > received > >> > > >> >> the > >> > > >> >> > > >> > scheduler > >> > > >> >> > > >> > > > >> pauses > >> > > >> >> > > >> > > > >> >> > the update."). > >> > > >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > > > >> >> > I agree that it's unsafe to to resume > >> updates > >> > > after > >> > > >> >> > > >> receiving a > >> > > >> >> > > >> > > > >> heartbeat > >> > > >> >> > > >> > > > >> >> > after previously pausing due to a missed > >> > > >> heartbeat. In > >> > > >> >> > > that > >> > > >> >> > > >> > > > scenario > >> > > >> >> > > >> > > > >> I'd > >> > > >> >> > > >> > > > >> >> > think we'd want an explicit > resumeJobUpdate. > >> > If > >> > > the > >> > > >> >> > > scenario > >> > > >> >> > > >> > > we're > >> > > >> >> > > >> > > > >> trying > >> > > >> >> > > >> > > > >> >> > to handle is *never* received a > heartbeat, > >> > > that's a > >> > > >> >> > > separate > >> > > >> >> > > >> > > > matter, > >> > > >> >> > > >> > > > >> in > >> > > >> >> > > >> > > > >> >> > that case unpausing upon receiving the > first > >> > > >> heartbeat > >> > > >> >> > > would > >> > > >> >> > > >> > make > >> > > >> >> > > >> > > > >> sense, > >> > > >> >> > > >> > > > >> >> > but it feels like that complicates things > >> > quite > >> > > a > >> > > >> bit > >> > > >> >> > > (now we > >> > > >> >> > > >> > > need > >> > > >> >> > > >> > > > to > >> > > >> >> > > >> > > > >> >> > differentiate between heartbeat #1 and > >> > hearbeat > >> > > >> #N). > >> > > >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill > >> Farner < > >> > > >> >> > > >> > wfar...@apache.org <javascript:;> > >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > >> wrote: > >> > > >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > > > >> >> >> What is the guidance for deploying while > >> the > >> > > >> >> heartbeat > >> > > >> >> > > >> service > >> > > >> >> > > >> > > is > >> > > >> >> > > >> > > > >> >> broken? > >> > > >> >> > > >> > > > >> >> >> I think i know the answer, but it's > >> important > >> > > to > >> > > >> >> spell > >> > > >> >> > > out. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > Create a new coordinated job update > in a > >> > > paused > >> > > >> >> > > >> > > > >> (ROLL_FORWARD_PAUSED) > >> > > >> >> > > >> > > > >> >> >> > state to avoid any progress until the > >> first > >> > > >> >> heartbeat > >> > > >> >> > > call > >> > > >> >> > > >> > > > arrives. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> I'm not sold on this being ultimately > >> > > >> beneficial. In > >> > > >> >> > the > >> > > >> >> > > >> > worst > >> > > >> >> > > >> > > > case, > >> > > >> >> > > >> > > > >> >> >> impact is still limited by the health > check > >> > > >> >> threshold. > >> > > >> >> > > >> Seems > >> > > >> >> > > >> > > like > >> > > >> >> > > >> > > > >> >> >> premature optimization at best, and an > odd > >> > one > >> > > if > >> > > >> we > >> > > >> >> > > proceed > >> > > >> >> > > >> > > > without > >> > > >> >> > > >> > > > >> a > >> > > >> >> > > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate > >> RPC. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> Allow resuming of the > >> > > paused-due-to-no-heartbeat > >> > > >> >> update > >> > > >> >> > > via > >> > > >> >> > > >> a > >> > > >> >> > > >> > > > >> >> >> > resumeJobUpdate call. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> Are heartbeats required while rolling > back? > >> > If > >> > > >> so, > >> > > >> >> > that > >> > > >> >> > > >> might > >> > > >> >> > > >> > > > impact > >> > > >> >> > > >> > > > >> >> the > >> > > >> >> > > >> > > > >> >> >> design here and in other places. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> Allow resuming of the > >> > > paused-due-to-no-heartbeat > >> > > >> >> update > >> > > >> >> > > via > >> > > >> >> > > >> a > >> > > >> >> > > >> > > > fresh > >> > > >> >> > > >> > > > >> >> >> > heartbeatJobUpdate call. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an > >> ACK, > >> > > but > >> > > >> we > >> > > >> >> > don't > >> > > >> >> > > >> > have a > >> > > >> >> > > >> > > > >> NACK. > >> > > >> >> > > >> > > > >> >> If > >> > > >> >> > > >> > > > >> >> >> we are going to let lack-of-ACK serve as > >> the > >> > > >> NACK, i > >> > > >> >> > > don't > >> > > >> >> > > >> > think > >> > > >> >> > > >> > > > it's > >> > > >> >> > > >> > > > >> >> safe > >> > > >> >> > > >> > > > >> >> >> to resume when we receive another ACK. > In > >> > > other > >> > > >> >> > words, a > >> > > >> >> > > >> > > service > >> > > >> >> > > >> > > > >> >> toggling > >> > > >> >> > > >> > > > >> >> >> unhealthy might not be deemed safe to > >> > proceed. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> Perhaps just sending OK (or a NOOP > >> > equivalent) > >> > > in > >> > > >> >> case > >> > > >> >> > > of a > >> > > >> >> > > >> > > > >> user-paused > >> > > >> >> > > >> > > > >> >> job > >> > > >> >> > > >> > > > >> >> >> > update would make more sense as there > is > >> > > nothing > >> > > >> >> > > >> monitoring > >> > > >> >> > > >> > > > service > >> > > >> >> > > >> > > > >> >> could > >> > > >> >> > > >> > > > >> >> >> > do in that case. This should work fine > >> with > >> > > >> >> > > pause/resume > >> > > >> >> > > >> > > > >> >> -aware/-agnostic > >> > > >> >> > > >> > > > >> >> >> > monitoring service implementation. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> This seems reasonable to me - heartbeats > >> for > >> > a > >> > > >> paused > >> > > >> >> > > update > >> > > >> >> > > >> > > > should > >> > > >> >> > > >> > > > >> not > >> > > >> >> > > >> > > > >> >> >> pose a risk, but can be safely ignored. > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> -=Bill > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim > >> > > >> Khutornenko < > >> > > >> >> > > >> > > > >> ma...@apache.org <javascript:;>> > >> > > >> >> > > >> > > > >> >> >> wrote: > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > Agreed. That would be a logical > >> > > generalization > >> > > >> of > >> > > >> >> the > >> > > >> >> > > post > >> > > >> >> > > >> > > > failover > >> > > >> >> > > >> > > > >> >> >> > behavior. > >> > > >> >> > > >> > > > >> >> >> > > >> > > >> >> > > >> > > > >> >> >> > I have updated the above document with > >> the > >> > > >> >> following > >> > > >> >> > > >> > changes: > >> > > >> >> > > >> > > > >> >> >> > - Reply with PAUSED any time a job was > >> > > paused by > >> > > >> >> > user; > >> > > >> >> > > >> > > > >> >> >> > - Start in paused state by default. > >> > > >> >> > > >> > > > >> >> >> > > >> > > >> >> > > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, > Kevin > >> > > Sweeney > >> > > >> < > >> > > >> >> > > >> > > > >> kevi...@apache.org <javascript:;>> > >> > > >> >> > > >> > > > >> >> >> > wrote: > >> > > >> >> > > >> > > > >> >> >> > > The doc mentioned that the scheduler > >> will > >> > > >> start > >> > > >> >> an > >> > > >> >> > > >> update > >> > > >> >> > > >> > > > >> subject to > >> > > >> >> > > >> > > > >> >> >> the > >> > > >> >> > > >> > > > >> >> >> > > heartbeat countdown, and if it > doesn't > >> > > >> receive a > >> > > >> >> > > >> heartbeat > >> > > >> >> > > >> > > it > >> > > >> >> > > >> > > > >> will > >> > > >> >> > > >> > > > >> >> >> pause > >> > > >> >> > > >> > > > >> >> >> > > the update. Why not start with the > >> update > >> > > >> >> > > >> > > > >> >> paused-due-to-no-heartbeat to > >> > > >> >> > > >> > > > >> >> >> > > fail-fast any connectivity issues > >> between > >> > > the > >> > > >> >> > service > >> > > >> >> > > >> > > > providing > >> > > >> >> > > >> > > > >> the > >> > > >> >> > > >> > > > >> >> >> > > heartbeats and the scheduler? > >> > > >> >> > > >> > > > >> >> >> > > > >> > > >> >> > > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, > Maxim > >> > > >> >> > Khutornenko < > >> > > >> >> > > >> > > > >> >> ma...@apache.org <javascript:;>> > >> > > >> >> > > >> > > > >> >> >> > > wrote: > >> > > >> >> > > >> > > > >> >> >> > > > >> > > >> >> > > >> > > > >> >> >> > >> Hi all, > >> > > >> >> > > >> > > > >> >> >> > >> > >> > > >> >> > > >> > > > >> >> >> > >> We are proposing a new feature for > the > >> > > >> scheduler > >> > > >> >> > > >> updater, > >> > > >> >> > > >> > > > which > >> > > >> >> > > >> > > > >> you > >> > > >> >> > > >> > > > >> >> >> > >> may find helpful. > >> > > >> >> > > >> > > > >> >> >> > >> > >> > > >> >> > > >> > > > >> >> >> > >> I have posed a brief feature > summary > >> > here: > >> > > >> >> > > >> > > > >> >> >> > >> > >> > > >> >> > > >> > > > >> >> >> > >> > >> > > >> >> > > >> > > > >> >> >> > > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> > >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > >> > > >> >> > > >> > > >> > > >> >> > > >> > >> > > >> >> > > > >> > > >> >> > > >> > > >> >> > >> > > >> > >> > > > >> > > >> > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > >> > > >> >> > > >> > > > >> >> >> > >> > >> > > >> >> > > >> > > > >> >> >> > >> Please, reply with your > >> > > >> >> > feedback/concerns/comments. > >> > > >> >> > > >> > > > >> >> >> > >> > >> > > >> >> > > >> > > > >> >> >> > >> Thanks, > >> > > >> >> > > >> > > > >> >> >> > >> Maxim > >> > > >> >> > > >> > > > >> >> >> > >> > >> > > >> >> > > >> > > > >> >> >> > > >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> > >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > >> > > >> >> > > >> > > >> > > >> >> > > >> > >> > > >> >> > > > >> > > >> >> > > >> > > >> >> > >> > > >> >> > >> > > >> >> > >> > > >> >> -- > >> > > >> >> Kevin Sweeney > >> > > >> >> @kts > >> > > >> >> > >> > > >> > >> > > > >> > > >> > > >> > -- > >> > -=Bill > >> > > >> > > > > > > > > -- > > Kevin Sweeney > > @kts >