Correct. The presence of the SessionKey does indeed mean that heartbeats are going to be authenticated. Given that external service has to solve authentication story to use pauseJobUpdate anyway, having heartbeats authenticated seems like a natural progression. Also, given our current admin thrift interface it's easier to have an authenticated RPC in it rather than not (scheduler_client.py automatically injects SessionKey arg into all methods that are not part of ReadOnlyScheduler interface).
On Thu, Oct 16, 2014 at 10:52 AM, Kevin Sweeney <kswee...@twitter.com.invalid> wrote: > I inferred that authentication was required due to the presence of a > SessionKey in the RPC. Of course any authentication mechanism here could > have serious scaling issues (barring something like HTTP basic auth in > memory) > > On Thu, Oct 16, 2014 at 10:48 AM, Joshua Cohen <jco...@twopensource.com> > wrote: > >> What are our thoughts about authentication with regards to heartbeats? It >> seems like they should be authenticated since there does exist the >> potential for a malicious actor to send its own heartbeats even if the real >> monitoring service has detected a problem and ceased sending heartbeats. >> I'm not sure exactly how large the attack surface is (if the service is >> truly down the scheduler would detect that and roll back the update >> regardless), but I think it's worth discussing as we work on the initial >> design. >> >> On Wed, Oct 15, 2014 at 7:49 PM, Bill Farner <wfar...@apache.org> wrote: >> >> > David - the plan is to synthesize the waiting state. Exactly how is not >> > yet certain. >> > >> > On Wednesday, October 15, 2014, Maxim Khutornenko <ma...@apache.org> >> > wrote: >> > >> > > It is certainly possible to add new state or a status message but I >> > > don't think it's a blocker for the first iteration. Provided there is >> > > enough demand a state/message could be synthesized during the 'get' >> > > call based on the volatile state. >> > > >> > > On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin < >> da...@dmclaughlin.com >> > > <javascript:;>> wrote: >> > > > +1 for pause being explicit RPC pauses, but does it really add >> > complexity >> > > > to just add a new state (WAITING?) when no heartbeat is sent? Not >> being >> > > > able to see that an update was blocked because of a lack of heartbeat >> > > seems >> > > > like a missing feature. >> > > > >> > > > On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko <ma...@apache.org >> > > <javascript:;>> wrote: >> > > > >> > > >> +1. Updated the doc: >> > > >> >> > > >> >> > > >> > >> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md >> > > >> >> > > >> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfar...@apache.org >> > > <javascript:;>> wrote: >> > > >> > +1 to the scheduler not proceeding on an update when heartbeats >> are >> > > >> absent, >> > > >> > and requiring the heartbeat service to explicitly call >> > pauseJobUpdate >> > > >> when >> > > >> > it detects problems. >> > > >> > >> > > >> > -=Bill >> > > >> > >> > > >> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney >> > > >> <kswee...@twitter.com.invalid >> > > >> >> wrote: >> > > >> > >> > > >> >> Chatted with Maxim and Bill, I think we figured it out >> > > >> >> >> > > >> >> I think the confusion stems from the fact that there are two >> types >> > of >> > > >> >> pauses in this system, explicit, persisted pauses generated by >> the >> > > >> >> pauseJobUpdate RPC and implicit, volatile pauses caused due to >> the >> > > >> absence >> > > >> >> of a sufficiently fresh heartbeat (such as in the case of a >> network >> > > >> >> partition). >> > > >> >> >> > > >> >> In case a monitoring service detects a problem it should call the >> > > >> explicit >> > > >> >> pauseJobUpdate RPC, which will cause a state change that requires >> > an >> > > >> >> explicit resumeJobUpdate RPC to resume. That feature already >> > exists. >> > > >> >> >> > > >> >> But, we need one more thing to make this reliable - heartbeats to >> > > >> protect >> > > >> >> against network partitions between the scheduler and the >> monitoring >> > > >> >> service. These can be volatile and lightweight - the scheduler >> just >> > > >> checks >> > > >> >> for a sufficiently fresh heartbeat before it performs an update >> > > action, >> > > >> and >> > > >> >> if none is present it simply refuses to perform the action. If >> the >> > > >> >> partition heals a new heartbeat will arrive (if the update being >> > > >> monitored >> > > >> >> should still be allowed to proceed) and the scheduler will allow >> > the >> > > >> update >> > > >> >> to proceed. >> > > >> >> >> > > >> >> >> > > >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner < >> wfar...@apache.org >> > > <javascript:;>> >> > > >> wrote: >> > > >> >> >> > > >> >> > I think we should assess that after building the rest of the >> > > feature. >> > > >> >> IIUC >> > > >> >> > the rest of the code doesn't care if the update is initially >> > > paused. >> > > >> >> > >> > > >> >> > -=Bill >> > > >> >> > >> > > >> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko < >> > > ma...@apache.org <javascript:;> >> > > >> > >> > > >> >> > wrote: >> > > >> >> > >> > > >> >> > > Can we get a consensus here? Looks like the only sticky point >> > > left >> > > >> is >> > > >> >> > > around starting an update in paused vs. non-paused state. I >> can >> > > >> argue >> > > >> >> > > either way as it's easy to add later if needed. >> > > >> >> > > >> > > >> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner < >> > wfar...@apache.org >> > > <javascript:;>> >> > > >> >> wrote: >> > > >> >> > > > I'm not arguing against the merits of the approach. Just >> > > feeling >> > > >> out >> > > >> >> > > > whether that should be done _after_ the rest of the >> heartbeat >> > > >> >> support. >> > > >> >> > > > Seems like it can be cleanly added at the end to get >> > something >> > > >> usable >> > > >> >> > > > earlier. >> > > >> >> > > > >> > > >> >> > > > -=Bill >> > > >> >> > > > >> > > >> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney < >> > > >> kevi...@apache.org <javascript:;>> >> > > >> >> > > wrote: >> > > >> >> > > > >> > > >> >> > > >> I'm +1 for using lack of heartbeats as a uniform >> > > >> >> unknown-or-unhealthy >> > > >> >> > > >> signal, and punting on a more complex NACK signal (which >> > we'd >> > > >> have >> > > >> >> to >> > > >> >> > > >> reliably persist). >> > > >> >> > > >> >> > > >> >> > > >> I think the only disagreement in this thread is whether >> the >> > > >> default >> > > >> >> > > state >> > > >> >> > > >> for a new update should be running or >> > waiting-for-heartbeat. I >> > > >> think >> > > >> >> > > >> waiting for a heartbeat is not only a more correct >> > > implementation >> > > >> >> (no >> > > >> >> > > risk >> > > >> >> > > >> of acting after a failover but before the heartbeat >> timeout) >> > > but >> > > >> >> > > simpler to >> > > >> >> > > >> implement (initialize the PulseMonitor data structure as >> > empty >> > > >> >> rather >> > > >> >> > > than >> > > >> >> > > >> with a synthetic heartbeat). >> > > >> >> > > >> >> > > >> >> > > >> From an API consumer perspective the sequence is: >> > > >> >> > > >> >> > > >> >> > > >> 1. API client sends a startUpdate RPC to the scheduler >> > > >> >> > > >> 2. API client receives an OK response, then arranges for >> > > >> something >> > > >> >> to >> > > >> >> > > call >> > > >> >> > > >> heartbeat with that updateId on some interval >> > > >> >> > > >> 3. Whatever is supposed to send heartbeats sends one >> > > immediately, >> > > >> >> then >> > > >> >> > > >> starts sending them on some smaller interval >> > > >> >> > > >> >> > > >> >> > > >> Waiting for the first heartbeat ensures that this sequence >> > has >> > > >> been >> > > >> >> > > >> completed successfully, while not waiting for it only >> ensure >> > > that >> > > >> >> > step 1 >> > > >> >> > > >> has happened. >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner < >> > > >> wfar...@apache.org <javascript:;>> >> > > >> >> > > wrote: >> > > >> >> > > >> >> > > >> >> > > >> > Wait - simpler solution than what? We're talking about >> > not >> > > >> doing >> > > >> >> > > either. >> > > >> >> > > >> > >> > > >> >> > > >> > -=Bill >> > > >> >> > > >> > >> > > >> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney < >> > > >> >> kevi...@apache.org <javascript:;> >> > > >> >> > > >> > > >> >> > > >> > wrote: >> > > >> >> > > >> > >> > > >> >> > > >> > > I think waiting for the first heartbeat before taking >> > any >> > > >> action >> > > >> >> > is >> > > >> >> > > the >> > > >> >> > > >> > > simpler solution here as it allows the implementation >> to >> > > be >> > > >> >> > entirely >> > > >> >> > > >> > > soft-state and still catches the bugs I described. >> > > >> >> > > >> > > >> > > >> >> > > >> > > The implementation is just PulseMonitorImpl<UpdateId> >> - >> > > >> >> heartbeat >> > > >> >> > > calls >> > > >> >> > > >> > > pulse and mutation operations check isAlive. I think >> the >> > > code >> > > >> >> > might >> > > >> >> > > >> > > actually work as-is. >> > > >> >> > > >> > > >> > > >> >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko < >> > > >> >> > > ma...@apache.org <javascript:;>> >> > > >> >> > > >> > > wrote: >> > > >> >> > > >> > > >> > > >> >> > > >> > > > Pausing update on creation seems like a logical >> > approach >> > > >> when >> > > >> >> > > dealing >> > > >> >> > > >> > > > with inverted dependency model. I.e. updater is >> happy >> > to >> > > >> act >> > > >> >> as >> > > >> >> > > long >> > > >> >> > > >> > > > as it's greenlighted by the external signal. It's >> also >> > > >> aligned >> > > >> >> > > with a >> > > >> >> > > >> > > > failover experience where coordinated updates are >> > > >> rehydrated >> > > >> >> in >> > > >> >> > > >> paused >> > > >> >> > > >> > > > state waiting for HB awakening. That said, I am OK >> > > punting >> > > >> it >> > > >> >> > for >> > > >> >> > > the >> > > >> >> > > >> > > > sake of simplicity for now. >> > > >> >> > > >> > > > >> > > >> >> > > >> > > > Kevin? >> > > >> >> > > >> > > > >> > > >> >> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner < >> > > >> >> > wfar...@apache.org <javascript:;> >> > > >> >> > > > >> > > >> >> > > >> > > wrote: >> > > >> >> > > >> > > > > If the goal is to reduce complexity now and add >> > > features >> > > >> >> > later, >> > > >> >> > > why >> > > >> >> > > >> > not >> > > >> >> > > >> > > > > nuke both for now - kick off the update right >> away, >> > > and >> > > >> let >> > > >> >> > > lack of >> > > >> >> > > >> > > > > heartbeats serve as a uniform "unknown or >> unhealthy" >> > > >> signal? >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > > -=Bill >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim >> Khutornenko < >> > > >> >> > > >> ma...@apache.org <javascript:;> >> > > >> >> > > >> > > >> > > >> >> > > >> > > > wrote: >> > > >> >> > > >> > > > > >> > > >> >> > > >> > > > >> I am still +1 on the idea to have default paused >> > > state >> > > >> on >> > > >> >> > > >> creation. >> > > >> >> > > >> > I >> > > >> >> > > >> > > > >> think we could still differentiate between >> > initially >> > > >> paused >> > > >> >> > and >> > > >> >> > > >> > timed >> > > >> >> > > >> > > > >> out states internally by looking at pause reason. >> > > It's >> > > >> >> quite >> > > >> >> > > >> > different >> > > >> >> > > >> > > > >> if we want to store explicit NACK reasons from >> the >> > > >> external >> > > >> >> > > >> service >> > > >> >> > > >> > > > >> though. That would require persistence and a bit >> > more >> > > >> >> > > complicated >> > > >> >> > > >> > > > >> logic. >> > > >> >> > > >> > > > >> >> > > >> >> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney < >> > > >> >> > > >> kevi...@apache.org <javascript:;>> >> > > >> >> > > >> > > > wrote: >> > > >> >> > > >> > > > >> > I like the idea of implementing this >> > scheduler-side >> > > >> >> purely >> > > >> >> > > >> through >> > > >> >> > > >> > > > >> volatile >> > > >> >> > > >> > > > >> > state, but the lack of feedback (generic vs >> > > specific >> > > >> >> error >> > > >> >> > > >> > messages >> > > >> >> > > >> > > > when >> > > >> >> > > >> > > > >> an >> > > >> >> > > >> > > > >> > update is paused) leaves something to be >> desired. >> > > >> Maybe >> > > >> >> we >> > > >> >> > > can >> > > >> >> > > >> > > address >> > > >> >> > > >> > > > >> that >> > > >> >> > > >> > > > >> > with a metadata field in the initial call to >> > > >> startUpdate >> > > >> >> > > (with >> > > >> >> > > >> an >> > > >> >> > > >> > > > >> optional >> > > >> >> > > >> > > > >> > link to a page where one can get more rich >> > > information >> > > >> >> > about >> > > >> >> > > the >> > > >> >> > > >> > > > state of >> > > >> >> > > >> > > > >> > the monitor sending/not sending heartbeats). >> > > >> >> > > >> > > > >> > >> > > >> >> > > >> > > > >> > The main drawback is that we may have to wait a >> > > >> maximum >> > > >> >> of >> > > >> >> > > one >> > > >> >> > > >> > > > heartbeat >> > > >> >> > > >> > > > >> > interval to find out that an update should be >> > > paused. >> > > >> >> > > >> > > > >> > >> > > >> >> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim >> > Khutornenko >> > > < >> > > >> >> > > >> > > ma...@apache.org <javascript:;>> >> > > >> >> > > >> > > > >> wrote: >> > > >> >> > > >> > > > >> > >> > > >> >> > > >> > > > >> >> The main reason I preferred the lack-of-ACK >> > > approach >> > > >> >> over >> > > >> >> > an >> > > >> >> > > >> > > explicit >> > > >> >> > > >> > > > >> >> NACK one is simplicity. As Joshua pointed out >> > > there >> > > >> is >> > > >> >> > more >> > > >> >> > > >> state >> > > >> >> > > >> > > to >> > > >> >> > > >> > > > >> >> handle in that case. The lack-of-ACK model can >> > be >> > > >> >> > completely >> > > >> >> > > >> > > > >> >> implemented in volatile memory sidestepping >> the >> > > >> >> persistent >> > > >> >> > > >> > storage >> > > >> >> > > >> > > > >> >> entirely. With the NACK we would need to >> > reliably >> > > >> >> persist >> > > >> >> > > >> > external >> > > >> >> > > >> > > > >> >> service call reasons to survive scheduler >> > > failovers. >> > > >> >> Not a >> > > >> >> > > huge >> > > >> >> > > >> > > > >> >> challenge but something to keep in mind. >> > > >> >> > > >> > > > >> >> >> > > >> >> > > >> > > > >> >> I still think the simplicity/reliability >> > tradeoff >> > > is >> > > >> >> > > acceptable >> > > >> >> > > >> > > here >> > > >> >> > > >> > > > >> >> if we rely on external service to abort >> > > heartbeats in >> > > >> >> case >> > > >> >> > > of a >> > > >> >> > > >> > > > health >> > > >> >> > > >> > > > >> >> alert fired. This can be explicitly documented >> > as >> > > an >> > > >> >> > > external >> > > >> >> > > >> > > > >> >> integration requirement. However, If the >> > consensus >> > > >> is to >> > > >> >> > go >> > > >> >> > > a >> > > >> >> > > >> > more >> > > >> >> > > >> > > > >> >> reliable (though more complicated) NACK route >> I >> > am >> > > >> happy >> > > >> >> > to >> > > >> >> > > >> > > > reconsider >> > > >> >> > > >> > > > >> >> the current proposal. >> > > >> >> > > >> > > > >> >> >> > > >> >> > > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen >> < >> > > >> >> > > >> > > > jco...@twopensource.com <javascript:;>> >> > > >> >> > > >> > > > >> >> wrote: >> > > >> >> > > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an >> ACK, >> > > but >> > > >> we >> > > >> >> > don't >> > > >> >> > > >> > have a >> > > >> >> > > >> > > > >> NACK. >> > > >> >> > > >> > > > >> >> If >> > > >> >> > > >> > > > >> >> > we are going to let lack-of-ACK serve as the >> > > NACK, >> > > >> i >> > > >> >> > don't >> > > >> >> > > >> > think >> > > >> >> > > >> > > > it's >> > > >> >> > > >> > > > >> >> safe >> > > >> >> > > >> > > > >> >> > to resume when we receive another ACK. In >> > other >> > > >> >> words, >> > > >> >> > a >> > > >> >> > > >> > service >> > > >> >> > > >> > > > >> >> toggling >> > > >> >> > > >> > > > >> >> > unhealthy might not be deemed safe to >> > proceed." >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> >> > Lack-of-ACK is the scenario where >> connectivity >> > > >> between >> > > >> >> > the >> > > >> >> > > >> > > monitor >> > > >> >> > > >> > > > and >> > > >> >> > > >> > > > >> >> the >> > > >> >> > > >> > > > >> >> > scheduler is unavailable. Shouldn't the NACK >> > > >> scenario >> > > >> >> > > >> > (everything >> > > >> >> > > >> > > > is >> > > >> >> > > >> > > > >> not >> > > >> >> > > >> > > > >> >> > ok!) be handled by the monitoring service >> > > >> triggering >> > > >> >> an >> > > >> >> > > >> > explicit >> > > >> >> > > >> > > > >> pause? >> > > >> >> > > >> > > > >> >> > I.e. section 2 should be updated to say >> > > "External >> > > >> >> > service >> > > >> >> > > >> > detects >> > > >> >> > > >> > > > >> service >> > > >> >> > > >> > > > >> >> > health problems and pauses the update" and >> > > section >> > > >> 4 >> > > >> >> > > becomes >> > > >> >> > > >> > the >> > > >> >> > > >> > > > >> current >> > > >> >> > > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not be >> > > received >> > > >> >> the >> > > >> >> > > >> > scheduler >> > > >> >> > > >> > > > >> pauses >> > > >> >> > > >> > > > >> >> > the update."). >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> >> > I agree that it's unsafe to to resume >> updates >> > > after >> > > >> >> > > >> receiving a >> > > >> >> > > >> > > > >> heartbeat >> > > >> >> > > >> > > > >> >> > after previously pausing due to a missed >> > > >> heartbeat. In >> > > >> >> > > that >> > > >> >> > > >> > > > scenario >> > > >> >> > > >> > > > >> I'd >> > > >> >> > > >> > > > >> >> > think we'd want an explicit resumeJobUpdate. >> > If >> > > the >> > > >> >> > > scenario >> > > >> >> > > >> > > we're >> > > >> >> > > >> > > > >> trying >> > > >> >> > > >> > > > >> >> > to handle is *never* received a heartbeat, >> > > that's a >> > > >> >> > > separate >> > > >> >> > > >> > > > matter, >> > > >> >> > > >> > > > >> in >> > > >> >> > > >> > > > >> >> > that case unpausing upon receiving the first >> > > >> heartbeat >> > > >> >> > > would >> > > >> >> > > >> > make >> > > >> >> > > >> > > > >> sense, >> > > >> >> > > >> > > > >> >> > but it feels like that complicates things >> > quite >> > > a >> > > >> bit >> > > >> >> > > (now we >> > > >> >> > > >> > > need >> > > >> >> > > >> > > > to >> > > >> >> > > >> > > > >> >> > differentiate between heartbeat #1 and >> > hearbeat >> > > >> #N). >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill >> Farner < >> > > >> >> > > >> > wfar...@apache.org <javascript:;> >> > > >> >> > > >> > > > >> > > >> >> > > >> > > > >> wrote: >> > > >> >> > > >> > > > >> >> > >> > > >> >> > > >> > > > >> >> >> What is the guidance for deploying while >> the >> > > >> >> heartbeat >> > > >> >> > > >> service >> > > >> >> > > >> > > is >> > > >> >> > > >> > > > >> >> broken? >> > > >> >> > > >> > > > >> >> >> I think i know the answer, but it's >> important >> > > to >> > > >> >> spell >> > > >> >> > > out. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> > Create a new coordinated job update in a >> > > paused >> > > >> >> > > >> > > > >> (ROLL_FORWARD_PAUSED) >> > > >> >> > > >> > > > >> >> >> > state to avoid any progress until the >> first >> > > >> >> heartbeat >> > > >> >> > > call >> > > >> >> > > >> > > > arrives. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> I'm not sold on this being ultimately >> > > >> beneficial. In >> > > >> >> > the >> > > >> >> > > >> > worst >> > > >> >> > > >> > > > case, >> > > >> >> > > >> > > > >> >> >> impact is still limited by the health check >> > > >> >> threshold. >> > > >> >> > > >> Seems >> > > >> >> > > >> > > like >> > > >> >> > > >> > > > >> >> >> premature optimization at best, and an odd >> > one >> > > if >> > > >> we >> > > >> >> > > proceed >> > > >> >> > > >> > > > without >> > > >> >> > > >> > > > >> a >> > > >> >> > > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate >> RPC. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> Allow resuming of the >> > > paused-due-to-no-heartbeat >> > > >> >> update >> > > >> >> > > via >> > > >> >> > > >> a >> > > >> >> > > >> > > > >> >> >> > resumeJobUpdate call. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> Are heartbeats required while rolling back? >> > If >> > > >> so, >> > > >> >> > that >> > > >> >> > > >> might >> > > >> >> > > >> > > > impact >> > > >> >> > > >> > > > >> >> the >> > > >> >> > > >> > > > >> >> >> design here and in other places. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> Allow resuming of the >> > > paused-due-to-no-heartbeat >> > > >> >> update >> > > >> >> > > via >> > > >> >> > > >> a >> > > >> >> > > >> > > > fresh >> > > >> >> > > >> > > > >> >> >> > heartbeatJobUpdate call. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an >> ACK, >> > > but >> > > >> we >> > > >> >> > don't >> > > >> >> > > >> > have a >> > > >> >> > > >> > > > >> NACK. >> > > >> >> > > >> > > > >> >> If >> > > >> >> > > >> > > > >> >> >> we are going to let lack-of-ACK serve as >> the >> > > >> NACK, i >> > > >> >> > > don't >> > > >> >> > > >> > think >> > > >> >> > > >> > > > it's >> > > >> >> > > >> > > > >> >> safe >> > > >> >> > > >> > > > >> >> >> to resume when we receive another ACK. In >> > > other >> > > >> >> > words, a >> > > >> >> > > >> > > service >> > > >> >> > > >> > > > >> >> toggling >> > > >> >> > > >> > > > >> >> >> unhealthy might not be deemed safe to >> > proceed. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> Perhaps just sending OK (or a NOOP >> > equivalent) >> > > in >> > > >> >> case >> > > >> >> > > of a >> > > >> >> > > >> > > > >> user-paused >> > > >> >> > > >> > > > >> >> job >> > > >> >> > > >> > > > >> >> >> > update would make more sense as there is >> > > nothing >> > > >> >> > > >> monitoring >> > > >> >> > > >> > > > service >> > > >> >> > > >> > > > >> >> could >> > > >> >> > > >> > > > >> >> >> > do in that case. This should work fine >> with >> > > >> >> > > pause/resume >> > > >> >> > > >> > > > >> >> -aware/-agnostic >> > > >> >> > > >> > > > >> >> >> > monitoring service implementation. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> This seems reasonable to me - heartbeats >> for >> > a >> > > >> paused >> > > >> >> > > update >> > > >> >> > > >> > > > should >> > > >> >> > > >> > > > >> not >> > > >> >> > > >> > > > >> >> >> pose a risk, but can be safely ignored. >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> -=Bill >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim >> > > >> Khutornenko < >> > > >> >> > > >> > > > >> ma...@apache.org <javascript:;>> >> > > >> >> > > >> > > > >> >> >> wrote: >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> > Agreed. That would be a logical >> > > generalization >> > > >> of >> > > >> >> the >> > > >> >> > > post >> > > >> >> > > >> > > > failover >> > > >> >> > > >> > > > >> >> >> > behavior. >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > I have updated the above document with >> the >> > > >> >> following >> > > >> >> > > >> > changes: >> > > >> >> > > >> > > > >> >> >> > - Reply with PAUSED any time a job was >> > > paused by >> > > >> >> > user; >> > > >> >> > > >> > > > >> >> >> > - Start in paused state by default. >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin >> > > Sweeney >> > > >> < >> > > >> >> > > >> > > > >> kevi...@apache.org <javascript:;>> >> > > >> >> > > >> > > > >> >> >> > wrote: >> > > >> >> > > >> > > > >> >> >> > > The doc mentioned that the scheduler >> will >> > > >> start >> > > >> >> an >> > > >> >> > > >> update >> > > >> >> > > >> > > > >> subject to >> > > >> >> > > >> > > > >> >> >> the >> > > >> >> > > >> > > > >> >> >> > > heartbeat countdown, and if it doesn't >> > > >> receive a >> > > >> >> > > >> heartbeat >> > > >> >> > > >> > > it >> > > >> >> > > >> > > > >> will >> > > >> >> > > >> > > > >> >> >> pause >> > > >> >> > > >> > > > >> >> >> > > the update. Why not start with the >> update >> > > >> >> > > >> > > > >> >> paused-due-to-no-heartbeat to >> > > >> >> > > >> > > > >> >> >> > > fail-fast any connectivity issues >> between >> > > the >> > > >> >> > service >> > > >> >> > > >> > > > providing >> > > >> >> > > >> > > > >> the >> > > >> >> > > >> > > > >> >> >> > > heartbeats and the scheduler? >> > > >> >> > > >> > > > >> >> >> > > >> > > >> >> > > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim >> > > >> >> > Khutornenko < >> > > >> >> > > >> > > > >> >> ma...@apache.org <javascript:;>> >> > > >> >> > > >> > > > >> >> >> > > wrote: >> > > >> >> > > >> > > > >> >> >> > > >> > > >> >> > > >> > > > >> >> >> > >> Hi all, >> > > >> >> > > >> > > > >> >> >> > >> >> > > >> >> > > >> > > > >> >> >> > >> We are proposing a new feature for the >> > > >> scheduler >> > > >> >> > > >> updater, >> > > >> >> > > >> > > > which >> > > >> >> > > >> > > > >> you >> > > >> >> > > >> > > > >> >> >> > >> may find helpful. >> > > >> >> > > >> > > > >> >> >> > >> >> > > >> >> > > >> > > > >> >> >> > >> I have posed a brief feature summary >> > here: >> > > >> >> > > >> > > > >> >> >> > >> >> > > >> >> > > >> > > > >> >> >> > >> >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> > > >> >> > > >> > > > >> >> > > >> >> > > >> > > > >> > > >> >> > > >> > > >> > > >> >> > > >> > >> > > >> >> > > >> >> > > >> >> > > >> > > >> >> > >> > > >> >> >> > > >> >> > > >> > >> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md >> > > >> >> > > >> > > > >> >> >> > >> >> > > >> >> > > >> > > > >> >> >> > >> Please, reply with your >> > > >> >> > feedback/concerns/comments. >> > > >> >> > > >> > > > >> >> >> > >> >> > > >> >> > > >> > > > >> >> >> > >> Thanks, >> > > >> >> > > >> > > > >> >> >> > >> Maxim >> > > >> >> > > >> > > > >> >> >> > >> >> > > >> >> > > >> > > > >> >> >> > >> > > >> >> > > >> > > > >> >> >> >> > > >> >> > > >> > > > >> >> >> > > >> >> > > >> > > > >> >> > > >> >> > > >> > > > >> > > >> >> > > >> > > >> > > >> >> > > >> > >> > > >> >> > > >> >> > > >> >> > > >> > > >> >> > >> > > >> >> >> > > >> >> >> > > >> >> >> > > >> >> -- >> > > >> >> Kevin Sweeney >> > > >> >> @kts >> > > >> >> >> > > >> >> > > >> > >> > >> > -- >> > -=Bill >> > >> > > > > -- > Kevin Sweeney > @kts