I inferred that authentication was required due to the presence of a SessionKey in the RPC. Of course any authentication mechanism here could have serious scaling issues (barring something like HTTP basic auth in memory)
On Thu, Oct 16, 2014 at 10:48 AM, Joshua Cohen <jco...@twopensource.com> wrote: > What are our thoughts about authentication with regards to heartbeats? It > seems like they should be authenticated since there does exist the > potential for a malicious actor to send its own heartbeats even if the real > monitoring service has detected a problem and ceased sending heartbeats. > I'm not sure exactly how large the attack surface is (if the service is > truly down the scheduler would detect that and roll back the update > regardless), but I think it's worth discussing as we work on the initial > design. > > On Wed, Oct 15, 2014 at 7:49 PM, Bill Farner <wfar...@apache.org> wrote: > > > David - the plan is to synthesize the waiting state. Exactly how is not > > yet certain. > > > > On Wednesday, October 15, 2014, Maxim Khutornenko <ma...@apache.org> > > wrote: > > > > > It is certainly possible to add new state or a status message but I > > > don't think it's a blocker for the first iteration. Provided there is > > > enough demand a state/message could be synthesized during the 'get' > > > call based on the volatile state. > > > > > > On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin < > da...@dmclaughlin.com > > > <javascript:;>> wrote: > > > > +1 for pause being explicit RPC pauses, but does it really add > > complexity > > > > to just add a new state (WAITING?) when no heartbeat is sent? Not > being > > > > able to see that an update was blocked because of a lack of heartbeat > > > seems > > > > like a missing feature. > > > > > > > > On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko <ma...@apache.org > > > <javascript:;>> wrote: > > > > > > > >> +1. Updated the doc: > > > >> > > > >> > > > > > > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > > >> > > > >> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfar...@apache.org > > > <javascript:;>> wrote: > > > >> > +1 to the scheduler not proceeding on an update when heartbeats > are > > > >> absent, > > > >> > and requiring the heartbeat service to explicitly call > > pauseJobUpdate > > > >> when > > > >> > it detects problems. > > > >> > > > > >> > -=Bill > > > >> > > > > >> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney > > > >> <kswee...@twitter.com.invalid > > > >> >> wrote: > > > >> > > > > >> >> Chatted with Maxim and Bill, I think we figured it out > > > >> >> > > > >> >> I think the confusion stems from the fact that there are two > types > > of > > > >> >> pauses in this system, explicit, persisted pauses generated by > the > > > >> >> pauseJobUpdate RPC and implicit, volatile pauses caused due to > the > > > >> absence > > > >> >> of a sufficiently fresh heartbeat (such as in the case of a > network > > > >> >> partition). > > > >> >> > > > >> >> In case a monitoring service detects a problem it should call the > > > >> explicit > > > >> >> pauseJobUpdate RPC, which will cause a state change that requires > > an > > > >> >> explicit resumeJobUpdate RPC to resume. That feature already > > exists. > > > >> >> > > > >> >> But, we need one more thing to make this reliable - heartbeats to > > > >> protect > > > >> >> against network partitions between the scheduler and the > monitoring > > > >> >> service. These can be volatile and lightweight - the scheduler > just > > > >> checks > > > >> >> for a sufficiently fresh heartbeat before it performs an update > > > action, > > > >> and > > > >> >> if none is present it simply refuses to perform the action. If > the > > > >> >> partition heals a new heartbeat will arrive (if the update being > > > >> monitored > > > >> >> should still be allowed to proceed) and the scheduler will allow > > the > > > >> update > > > >> >> to proceed. > > > >> >> > > > >> >> > > > >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner < > wfar...@apache.org > > > <javascript:;>> > > > >> wrote: > > > >> >> > > > >> >> > I think we should assess that after building the rest of the > > > feature. > > > >> >> IIUC > > > >> >> > the rest of the code doesn't care if the update is initially > > > paused. > > > >> >> > > > > >> >> > -=Bill > > > >> >> > > > > >> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko < > > > ma...@apache.org <javascript:;> > > > >> > > > > >> >> > wrote: > > > >> >> > > > > >> >> > > Can we get a consensus here? Looks like the only sticky point > > > left > > > >> is > > > >> >> > > around starting an update in paused vs. non-paused state. I > can > > > >> argue > > > >> >> > > either way as it's easy to add later if needed. > > > >> >> > > > > > >> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner < > > wfar...@apache.org > > > <javascript:;>> > > > >> >> wrote: > > > >> >> > > > I'm not arguing against the merits of the approach. Just > > > feeling > > > >> out > > > >> >> > > > whether that should be done _after_ the rest of the > heartbeat > > > >> >> support. > > > >> >> > > > Seems like it can be cleanly added at the end to get > > something > > > >> usable > > > >> >> > > > earlier. > > > >> >> > > > > > > >> >> > > > -=Bill > > > >> >> > > > > > > >> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney < > > > >> kevi...@apache.org <javascript:;>> > > > >> >> > > wrote: > > > >> >> > > > > > > >> >> > > >> I'm +1 for using lack of heartbeats as a uniform > > > >> >> unknown-or-unhealthy > > > >> >> > > >> signal, and punting on a more complex NACK signal (which > > we'd > > > >> have > > > >> >> to > > > >> >> > > >> reliably persist). > > > >> >> > > >> > > > >> >> > > >> I think the only disagreement in this thread is whether > the > > > >> default > > > >> >> > > state > > > >> >> > > >> for a new update should be running or > > waiting-for-heartbeat. I > > > >> think > > > >> >> > > >> waiting for a heartbeat is not only a more correct > > > implementation > > > >> >> (no > > > >> >> > > risk > > > >> >> > > >> of acting after a failover but before the heartbeat > timeout) > > > but > > > >> >> > > simpler to > > > >> >> > > >> implement (initialize the PulseMonitor data structure as > > empty > > > >> >> rather > > > >> >> > > than > > > >> >> > > >> with a synthetic heartbeat). > > > >> >> > > >> > > > >> >> > > >> From an API consumer perspective the sequence is: > > > >> >> > > >> > > > >> >> > > >> 1. API client sends a startUpdate RPC to the scheduler > > > >> >> > > >> 2. API client receives an OK response, then arranges for > > > >> something > > > >> >> to > > > >> >> > > call > > > >> >> > > >> heartbeat with that updateId on some interval > > > >> >> > > >> 3. Whatever is supposed to send heartbeats sends one > > > immediately, > > > >> >> then > > > >> >> > > >> starts sending them on some smaller interval > > > >> >> > > >> > > > >> >> > > >> Waiting for the first heartbeat ensures that this sequence > > has > > > >> been > > > >> >> > > >> completed successfully, while not waiting for it only > ensure > > > that > > > >> >> > step 1 > > > >> >> > > >> has happened. > > > >> >> > > >> > > > >> >> > > >> > > > >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner < > > > >> wfar...@apache.org <javascript:;>> > > > >> >> > > wrote: > > > >> >> > > >> > > > >> >> > > >> > Wait - simpler solution than what? We're talking about > > not > > > >> doing > > > >> >> > > either. > > > >> >> > > >> > > > > >> >> > > >> > -=Bill > > > >> >> > > >> > > > > >> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney < > > > >> >> kevi...@apache.org <javascript:;> > > > >> >> > > > > > >> >> > > >> > wrote: > > > >> >> > > >> > > > > >> >> > > >> > > I think waiting for the first heartbeat before taking > > any > > > >> action > > > >> >> > is > > > >> >> > > the > > > >> >> > > >> > > simpler solution here as it allows the implementation > to > > > be > > > >> >> > entirely > > > >> >> > > >> > > soft-state and still catches the bugs I described. > > > >> >> > > >> > > > > > >> >> > > >> > > The implementation is just PulseMonitorImpl<UpdateId> > - > > > >> >> heartbeat > > > >> >> > > calls > > > >> >> > > >> > > pulse and mutation operations check isAlive. I think > the > > > code > > > >> >> > might > > > >> >> > > >> > > actually work as-is. > > > >> >> > > >> > > > > > >> >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko < > > > >> >> > > ma...@apache.org <javascript:;>> > > > >> >> > > >> > > wrote: > > > >> >> > > >> > > > > > >> >> > > >> > > > Pausing update on creation seems like a logical > > approach > > > >> when > > > >> >> > > dealing > > > >> >> > > >> > > > with inverted dependency model. I.e. updater is > happy > > to > > > >> act > > > >> >> as > > > >> >> > > long > > > >> >> > > >> > > > as it's greenlighted by the external signal. It's > also > > > >> aligned > > > >> >> > > with a > > > >> >> > > >> > > > failover experience where coordinated updates are > > > >> rehydrated > > > >> >> in > > > >> >> > > >> paused > > > >> >> > > >> > > > state waiting for HB awakening. That said, I am OK > > > punting > > > >> it > > > >> >> > for > > > >> >> > > the > > > >> >> > > >> > > > sake of simplicity for now. > > > >> >> > > >> > > > > > > >> >> > > >> > > > Kevin? > > > >> >> > > >> > > > > > > >> >> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner < > > > >> >> > wfar...@apache.org <javascript:;> > > > >> >> > > > > > > >> >> > > >> > > wrote: > > > >> >> > > >> > > > > If the goal is to reduce complexity now and add > > > features > > > >> >> > later, > > > >> >> > > why > > > >> >> > > >> > not > > > >> >> > > >> > > > > nuke both for now - kick off the update right > away, > > > and > > > >> let > > > >> >> > > lack of > > > >> >> > > >> > > > > heartbeats serve as a uniform "unknown or > unhealthy" > > > >> signal? > > > >> >> > > >> > > > > > > > >> >> > > >> > > > > -=Bill > > > >> >> > > >> > > > > > > > >> >> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim > Khutornenko < > > > >> >> > > >> ma...@apache.org <javascript:;> > > > >> >> > > >> > > > > > >> >> > > >> > > > wrote: > > > >> >> > > >> > > > > > > > >> >> > > >> > > > >> I am still +1 on the idea to have default paused > > > state > > > >> on > > > >> >> > > >> creation. > > > >> >> > > >> > I > > > >> >> > > >> > > > >> think we could still differentiate between > > initially > > > >> paused > > > >> >> > and > > > >> >> > > >> > timed > > > >> >> > > >> > > > >> out states internally by looking at pause reason. > > > It's > > > >> >> quite > > > >> >> > > >> > different > > > >> >> > > >> > > > >> if we want to store explicit NACK reasons from > the > > > >> external > > > >> >> > > >> service > > > >> >> > > >> > > > >> though. That would require persistence and a bit > > more > > > >> >> > > complicated > > > >> >> > > >> > > > >> logic. > > > >> >> > > >> > > > >> > > > >> >> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney < > > > >> >> > > >> kevi...@apache.org <javascript:;>> > > > >> >> > > >> > > > wrote: > > > >> >> > > >> > > > >> > I like the idea of implementing this > > scheduler-side > > > >> >> purely > > > >> >> > > >> through > > > >> >> > > >> > > > >> volatile > > > >> >> > > >> > > > >> > state, but the lack of feedback (generic vs > > > specific > > > >> >> error > > > >> >> > > >> > messages > > > >> >> > > >> > > > when > > > >> >> > > >> > > > >> an > > > >> >> > > >> > > > >> > update is paused) leaves something to be > desired. > > > >> Maybe > > > >> >> we > > > >> >> > > can > > > >> >> > > >> > > address > > > >> >> > > >> > > > >> that > > > >> >> > > >> > > > >> > with a metadata field in the initial call to > > > >> startUpdate > > > >> >> > > (with > > > >> >> > > >> an > > > >> >> > > >> > > > >> optional > > > >> >> > > >> > > > >> > link to a page where one can get more rich > > > information > > > >> >> > about > > > >> >> > > the > > > >> >> > > >> > > > state of > > > >> >> > > >> > > > >> > the monitor sending/not sending heartbeats). > > > >> >> > > >> > > > >> > > > > >> >> > > >> > > > >> > The main drawback is that we may have to wait a > > > >> maximum > > > >> >> of > > > >> >> > > one > > > >> >> > > >> > > > heartbeat > > > >> >> > > >> > > > >> > interval to find out that an update should be > > > paused. > > > >> >> > > >> > > > >> > > > > >> >> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim > > Khutornenko > > > < > > > >> >> > > >> > > ma...@apache.org <javascript:;>> > > > >> >> > > >> > > > >> wrote: > > > >> >> > > >> > > > >> > > > > >> >> > > >> > > > >> >> The main reason I preferred the lack-of-ACK > > > approach > > > >> >> over > > > >> >> > an > > > >> >> > > >> > > explicit > > > >> >> > > >> > > > >> >> NACK one is simplicity. As Joshua pointed out > > > there > > > >> is > > > >> >> > more > > > >> >> > > >> state > > > >> >> > > >> > > to > > > >> >> > > >> > > > >> >> handle in that case. The lack-of-ACK model can > > be > > > >> >> > completely > > > >> >> > > >> > > > >> >> implemented in volatile memory sidestepping > the > > > >> >> persistent > > > >> >> > > >> > storage > > > >> >> > > >> > > > >> >> entirely. With the NACK we would need to > > reliably > > > >> >> persist > > > >> >> > > >> > external > > > >> >> > > >> > > > >> >> service call reasons to survive scheduler > > > failovers. > > > >> >> Not a > > > >> >> > > huge > > > >> >> > > >> > > > >> >> challenge but something to keep in mind. > > > >> >> > > >> > > > >> >> > > > >> >> > > >> > > > >> >> I still think the simplicity/reliability > > tradeoff > > > is > > > >> >> > > acceptable > > > >> >> > > >> > > here > > > >> >> > > >> > > > >> >> if we rely on external service to abort > > > heartbeats in > > > >> >> case > > > >> >> > > of a > > > >> >> > > >> > > > health > > > >> >> > > >> > > > >> >> alert fired. This can be explicitly documented > > as > > > an > > > >> >> > > external > > > >> >> > > >> > > > >> >> integration requirement. However, If the > > consensus > > > >> is to > > > >> >> > go > > > >> >> > > a > > > >> >> > > >> > more > > > >> >> > > >> > > > >> >> reliable (though more complicated) NACK route > I > > am > > > >> happy > > > >> >> > to > > > >> >> > > >> > > > reconsider > > > >> >> > > >> > > > >> >> the current proposal. > > > >> >> > > >> > > > >> >> > > > >> >> > > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen > < > > > >> >> > > >> > > > jco...@twopensource.com <javascript:;>> > > > >> >> > > >> > > > >> >> wrote: > > > >> >> > > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an > ACK, > > > but > > > >> we > > > >> >> > don't > > > >> >> > > >> > have a > > > >> >> > > >> > > > >> NACK. > > > >> >> > > >> > > > >> >> If > > > >> >> > > >> > > > >> >> > we are going to let lack-of-ACK serve as the > > > NACK, > > > >> i > > > >> >> > don't > > > >> >> > > >> > think > > > >> >> > > >> > > > it's > > > >> >> > > >> > > > >> >> safe > > > >> >> > > >> > > > >> >> > to resume when we receive another ACK. In > > other > > > >> >> words, > > > >> >> > a > > > >> >> > > >> > service > > > >> >> > > >> > > > >> >> toggling > > > >> >> > > >> > > > >> >> > unhealthy might not be deemed safe to > > proceed." > > > >> >> > > >> > > > >> >> > > > > >> >> > > >> > > > >> >> > Lack-of-ACK is the scenario where > connectivity > > > >> between > > > >> >> > the > > > >> >> > > >> > > monitor > > > >> >> > > >> > > > and > > > >> >> > > >> > > > >> >> the > > > >> >> > > >> > > > >> >> > scheduler is unavailable. Shouldn't the NACK > > > >> scenario > > > >> >> > > >> > (everything > > > >> >> > > >> > > > is > > > >> >> > > >> > > > >> not > > > >> >> > > >> > > > >> >> > ok!) be handled by the monitoring service > > > >> triggering > > > >> >> an > > > >> >> > > >> > explicit > > > >> >> > > >> > > > >> pause? > > > >> >> > > >> > > > >> >> > I.e. section 2 should be updated to say > > > "External > > > >> >> > service > > > >> >> > > >> > detects > > > >> >> > > >> > > > >> service > > > >> >> > > >> > > > >> >> > health problems and pauses the update" and > > > section > > > >> 4 > > > >> >> > > becomes > > > >> >> > > >> > the > > > >> >> > > >> > > > >> current > > > >> >> > > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not be > > > received > > > >> >> the > > > >> >> > > >> > scheduler > > > >> >> > > >> > > > >> pauses > > > >> >> > > >> > > > >> >> > the update."). > > > >> >> > > >> > > > >> >> > > > > >> >> > > >> > > > >> >> > I agree that it's unsafe to to resume > updates > > > after > > > >> >> > > >> receiving a > > > >> >> > > >> > > > >> heartbeat > > > >> >> > > >> > > > >> >> > after previously pausing due to a missed > > > >> heartbeat. In > > > >> >> > > that > > > >> >> > > >> > > > scenario > > > >> >> > > >> > > > >> I'd > > > >> >> > > >> > > > >> >> > think we'd want an explicit resumeJobUpdate. > > If > > > the > > > >> >> > > scenario > > > >> >> > > >> > > we're > > > >> >> > > >> > > > >> trying > > > >> >> > > >> > > > >> >> > to handle is *never* received a heartbeat, > > > that's a > > > >> >> > > separate > > > >> >> > > >> > > > matter, > > > >> >> > > >> > > > >> in > > > >> >> > > >> > > > >> >> > that case unpausing upon receiving the first > > > >> heartbeat > > > >> >> > > would > > > >> >> > > >> > make > > > >> >> > > >> > > > >> sense, > > > >> >> > > >> > > > >> >> > but it feels like that complicates things > > quite > > > a > > > >> bit > > > >> >> > > (now we > > > >> >> > > >> > > need > > > >> >> > > >> > > > to > > > >> >> > > >> > > > >> >> > differentiate between heartbeat #1 and > > hearbeat > > > >> #N). > > > >> >> > > >> > > > >> >> > > > > >> >> > > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill > Farner < > > > >> >> > > >> > wfar...@apache.org <javascript:;> > > > >> >> > > >> > > > > > > >> >> > > >> > > > >> wrote: > > > >> >> > > >> > > > >> >> > > > > >> >> > > >> > > > >> >> >> What is the guidance for deploying while > the > > > >> >> heartbeat > > > >> >> > > >> service > > > >> >> > > >> > > is > > > >> >> > > >> > > > >> >> broken? > > > >> >> > > >> > > > >> >> >> I think i know the answer, but it's > important > > > to > > > >> >> spell > > > >> >> > > out. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > Create a new coordinated job update in a > > > paused > > > >> >> > > >> > > > >> (ROLL_FORWARD_PAUSED) > > > >> >> > > >> > > > >> >> >> > state to avoid any progress until the > first > > > >> >> heartbeat > > > >> >> > > call > > > >> >> > > >> > > > arrives. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> I'm not sold on this being ultimately > > > >> beneficial. In > > > >> >> > the > > > >> >> > > >> > worst > > > >> >> > > >> > > > case, > > > >> >> > > >> > > > >> >> >> impact is still limited by the health check > > > >> >> threshold. > > > >> >> > > >> Seems > > > >> >> > > >> > > like > > > >> >> > > >> > > > >> >> >> premature optimization at best, and an odd > > one > > > if > > > >> we > > > >> >> > > proceed > > > >> >> > > >> > > > without > > > >> >> > > >> > > > >> a > > > >> >> > > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate > RPC. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> Allow resuming of the > > > paused-due-to-no-heartbeat > > > >> >> update > > > >> >> > > via > > > >> >> > > >> a > > > >> >> > > >> > > > >> >> >> > resumeJobUpdate call. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> Are heartbeats required while rolling back? > > If > > > >> so, > > > >> >> > that > > > >> >> > > >> might > > > >> >> > > >> > > > impact > > > >> >> > > >> > > > >> >> the > > > >> >> > > >> > > > >> >> >> design here and in other places. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> Allow resuming of the > > > paused-due-to-no-heartbeat > > > >> >> update > > > >> >> > > via > > > >> >> > > >> a > > > >> >> > > >> > > > fresh > > > >> >> > > >> > > > >> >> >> > heartbeatJobUpdate call. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an > ACK, > > > but > > > >> we > > > >> >> > don't > > > >> >> > > >> > have a > > > >> >> > > >> > > > >> NACK. > > > >> >> > > >> > > > >> >> If > > > >> >> > > >> > > > >> >> >> we are going to let lack-of-ACK serve as > the > > > >> NACK, i > > > >> >> > > don't > > > >> >> > > >> > think > > > >> >> > > >> > > > it's > > > >> >> > > >> > > > >> >> safe > > > >> >> > > >> > > > >> >> >> to resume when we receive another ACK. In > > > other > > > >> >> > words, a > > > >> >> > > >> > > service > > > >> >> > > >> > > > >> >> toggling > > > >> >> > > >> > > > >> >> >> unhealthy might not be deemed safe to > > proceed. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> Perhaps just sending OK (or a NOOP > > equivalent) > > > in > > > >> >> case > > > >> >> > > of a > > > >> >> > > >> > > > >> user-paused > > > >> >> > > >> > > > >> >> job > > > >> >> > > >> > > > >> >> >> > update would make more sense as there is > > > nothing > > > >> >> > > >> monitoring > > > >> >> > > >> > > > service > > > >> >> > > >> > > > >> >> could > > > >> >> > > >> > > > >> >> >> > do in that case. This should work fine > with > > > >> >> > > pause/resume > > > >> >> > > >> > > > >> >> -aware/-agnostic > > > >> >> > > >> > > > >> >> >> > monitoring service implementation. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> This seems reasonable to me - heartbeats > for > > a > > > >> paused > > > >> >> > > update > > > >> >> > > >> > > > should > > > >> >> > > >> > > > >> not > > > >> >> > > >> > > > >> >> >> pose a risk, but can be safely ignored. > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> -=Bill > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim > > > >> Khutornenko < > > > >> >> > > >> > > > >> ma...@apache.org <javascript:;>> > > > >> >> > > >> > > > >> >> >> wrote: > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> >> > Agreed. That would be a logical > > > generalization > > > >> of > > > >> >> the > > > >> >> > > post > > > >> >> > > >> > > > failover > > > >> >> > > >> > > > >> >> >> > behavior. > > > >> >> > > >> > > > >> >> >> > > > > >> >> > > >> > > > >> >> >> > I have updated the above document with > the > > > >> >> following > > > >> >> > > >> > changes: > > > >> >> > > >> > > > >> >> >> > - Reply with PAUSED any time a job was > > > paused by > > > >> >> > user; > > > >> >> > > >> > > > >> >> >> > - Start in paused state by default. > > > >> >> > > >> > > > >> >> >> > > > > >> >> > > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin > > > Sweeney > > > >> < > > > >> >> > > >> > > > >> kevi...@apache.org <javascript:;>> > > > >> >> > > >> > > > >> >> >> > wrote: > > > >> >> > > >> > > > >> >> >> > > The doc mentioned that the scheduler > will > > > >> start > > > >> >> an > > > >> >> > > >> update > > > >> >> > > >> > > > >> subject to > > > >> >> > > >> > > > >> >> >> the > > > >> >> > > >> > > > >> >> >> > > heartbeat countdown, and if it doesn't > > > >> receive a > > > >> >> > > >> heartbeat > > > >> >> > > >> > > it > > > >> >> > > >> > > > >> will > > > >> >> > > >> > > > >> >> >> pause > > > >> >> > > >> > > > >> >> >> > > the update. Why not start with the > update > > > >> >> > > >> > > > >> >> paused-due-to-no-heartbeat to > > > >> >> > > >> > > > >> >> >> > > fail-fast any connectivity issues > between > > > the > > > >> >> > service > > > >> >> > > >> > > > providing > > > >> >> > > >> > > > >> the > > > >> >> > > >> > > > >> >> >> > > heartbeats and the scheduler? > > > >> >> > > >> > > > >> >> >> > > > > > >> >> > > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim > > > >> >> > Khutornenko < > > > >> >> > > >> > > > >> >> ma...@apache.org <javascript:;>> > > > >> >> > > >> > > > >> >> >> > > wrote: > > > >> >> > > >> > > > >> >> >> > > > > > >> >> > > >> > > > >> >> >> > >> Hi all, > > > >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> >> > >> We are proposing a new feature for the > > > >> scheduler > > > >> >> > > >> updater, > > > >> >> > > >> > > > which > > > >> >> > > >> > > > >> you > > > >> >> > > >> > > > >> >> >> > >> may find helpful. > > > >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> >> > >> I have posed a brief feature summary > > here: > > > >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> >> > > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> > > > >> >> > > >> > > > >> > > > >> >> > > >> > > > > > > >> >> > > >> > > > > > >> >> > > >> > > > > >> >> > > >> > > > >> >> > > > > > >> >> > > > > >> >> > > > >> > > > > > > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > > >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> >> > >> Please, reply with your > > > >> >> > feedback/concerns/comments. > > > >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> >> > >> Thanks, > > > >> >> > > >> > > > >> >> >> > >> Maxim > > > >> >> > > >> > > > >> >> >> > >> > > > >> >> > > >> > > > >> >> >> > > > > >> >> > > >> > > > >> >> >> > > > >> >> > > >> > > > >> >> > > > >> >> > > >> > > > >> > > > >> >> > > >> > > > > > > >> >> > > >> > > > > > >> >> > > >> > > > > >> >> > > >> > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > > >> >> > > > >> >> -- > > > >> >> Kevin Sweeney > > > >> >> @kts > > > >> >> > > > >> > > > > > > > > > -- > > -=Bill > > > -- Kevin Sweeney @kts