It is certainly possible to add new state or a status message but I don't think it's a blocker for the first iteration. Provided there is enough demand a state/message could be synthesized during the 'get' call based on the volatile state.
On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin <da...@dmclaughlin.com> wrote: > +1 for pause being explicit RPC pauses, but does it really add complexity > to just add a new state (WAITING?) when no heartbeat is sent? Not being > able to see that an update was blocked because of a lack of heartbeat seems > like a missing feature. > > On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko <ma...@apache.org> wrote: > >> +1. Updated the doc: >> >> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md >> >> On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner <wfar...@apache.org> wrote: >> > +1 to the scheduler not proceeding on an update when heartbeats are >> absent, >> > and requiring the heartbeat service to explicitly call pauseJobUpdate >> when >> > it detects problems. >> > >> > -=Bill >> > >> > On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney >> <kswee...@twitter.com.invalid >> >> wrote: >> > >> >> Chatted with Maxim and Bill, I think we figured it out >> >> >> >> I think the confusion stems from the fact that there are two types of >> >> pauses in this system, explicit, persisted pauses generated by the >> >> pauseJobUpdate RPC and implicit, volatile pauses caused due to the >> absence >> >> of a sufficiently fresh heartbeat (such as in the case of a network >> >> partition). >> >> >> >> In case a monitoring service detects a problem it should call the >> explicit >> >> pauseJobUpdate RPC, which will cause a state change that requires an >> >> explicit resumeJobUpdate RPC to resume. That feature already exists. >> >> >> >> But, we need one more thing to make this reliable - heartbeats to >> protect >> >> against network partitions between the scheduler and the monitoring >> >> service. These can be volatile and lightweight - the scheduler just >> checks >> >> for a sufficiently fresh heartbeat before it performs an update action, >> and >> >> if none is present it simply refuses to perform the action. If the >> >> partition heals a new heartbeat will arrive (if the update being >> monitored >> >> should still be allowed to proceed) and the scheduler will allow the >> update >> >> to proceed. >> >> >> >> >> >> On Wed, Oct 15, 2014 at 11:56 AM, Bill Farner <wfar...@apache.org> >> wrote: >> >> >> >> > I think we should assess that after building the rest of the feature. >> >> IIUC >> >> > the rest of the code doesn't care if the update is initially paused. >> >> > >> >> > -=Bill >> >> > >> >> > On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko <ma...@apache.org >> > >> >> > wrote: >> >> > >> >> > > Can we get a consensus here? Looks like the only sticky point left >> is >> >> > > around starting an update in paused vs. non-paused state. I can >> argue >> >> > > either way as it's easy to add later if needed. >> >> > > >> >> > > On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner <wfar...@apache.org> >> >> wrote: >> >> > > > I'm not arguing against the merits of the approach. Just feeling >> out >> >> > > > whether that should be done _after_ the rest of the heartbeat >> >> support. >> >> > > > Seems like it can be cleanly added at the end to get something >> usable >> >> > > > earlier. >> >> > > > >> >> > > > -=Bill >> >> > > > >> >> > > > On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney < >> kevi...@apache.org> >> >> > > wrote: >> >> > > > >> >> > > >> I'm +1 for using lack of heartbeats as a uniform >> >> unknown-or-unhealthy >> >> > > >> signal, and punting on a more complex NACK signal (which we'd >> have >> >> to >> >> > > >> reliably persist). >> >> > > >> >> >> > > >> I think the only disagreement in this thread is whether the >> default >> >> > > state >> >> > > >> for a new update should be running or waiting-for-heartbeat. I >> think >> >> > > >> waiting for a heartbeat is not only a more correct implementation >> >> (no >> >> > > risk >> >> > > >> of acting after a failover but before the heartbeat timeout) but >> >> > > simpler to >> >> > > >> implement (initialize the PulseMonitor data structure as empty >> >> rather >> >> > > than >> >> > > >> with a synthetic heartbeat). >> >> > > >> >> >> > > >> From an API consumer perspective the sequence is: >> >> > > >> >> >> > > >> 1. API client sends a startUpdate RPC to the scheduler >> >> > > >> 2. API client receives an OK response, then arranges for >> something >> >> to >> >> > > call >> >> > > >> heartbeat with that updateId on some interval >> >> > > >> 3. Whatever is supposed to send heartbeats sends one immediately, >> >> then >> >> > > >> starts sending them on some smaller interval >> >> > > >> >> >> > > >> Waiting for the first heartbeat ensures that this sequence has >> been >> >> > > >> completed successfully, while not waiting for it only ensure that >> >> > step 1 >> >> > > >> has happened. >> >> > > >> >> >> > > >> >> >> > > >> On Tue, Oct 14, 2014 at 12:18 PM, Bill Farner < >> wfar...@apache.org> >> >> > > wrote: >> >> > > >> >> >> > > >> > Wait - simpler solution than what? We're talking about not >> doing >> >> > > either. >> >> > > >> > >> >> > > >> > -=Bill >> >> > > >> > >> >> > > >> > On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney < >> >> kevi...@apache.org >> >> > > >> >> > > >> > wrote: >> >> > > >> > >> >> > > >> > > I think waiting for the first heartbeat before taking any >> action >> >> > is >> >> > > the >> >> > > >> > > simpler solution here as it allows the implementation to be >> >> > entirely >> >> > > >> > > soft-state and still catches the bugs I described. >> >> > > >> > > >> >> > > >> > > The implementation is just PulseMonitorImpl<UpdateId> - >> >> heartbeat >> >> > > calls >> >> > > >> > > pulse and mutation operations check isAlive. I think the code >> >> > might >> >> > > >> > > actually work as-is. >> >> > > >> > > >> >> > > >> > > On Tue, Oct 14, 2014 at 12:11 PM, Maxim Khutornenko < >> >> > > ma...@apache.org> >> >> > > >> > > wrote: >> >> > > >> > > >> >> > > >> > > > Pausing update on creation seems like a logical approach >> when >> >> > > dealing >> >> > > >> > > > with inverted dependency model. I.e. updater is happy to >> act >> >> as >> >> > > long >> >> > > >> > > > as it's greenlighted by the external signal. It's also >> aligned >> >> > > with a >> >> > > >> > > > failover experience where coordinated updates are >> rehydrated >> >> in >> >> > > >> paused >> >> > > >> > > > state waiting for HB awakening. That said, I am OK punting >> it >> >> > for >> >> > > the >> >> > > >> > > > sake of simplicity for now. >> >> > > >> > > > >> >> > > >> > > > Kevin? >> >> > > >> > > > >> >> > > >> > > > On Tue, Oct 14, 2014 at 12:05 PM, Bill Farner < >> >> > wfar...@apache.org >> >> > > > >> >> > > >> > > wrote: >> >> > > >> > > > > If the goal is to reduce complexity now and add features >> >> > later, >> >> > > why >> >> > > >> > not >> >> > > >> > > > > nuke both for now - kick off the update right away, and >> let >> >> > > lack of >> >> > > >> > > > > heartbeats serve as a uniform "unknown or unhealthy" >> signal? >> >> > > >> > > > > >> >> > > >> > > > > -=Bill >> >> > > >> > > > > >> >> > > >> > > > > On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko < >> >> > > >> ma...@apache.org >> >> > > >> > > >> >> > > >> > > > wrote: >> >> > > >> > > > > >> >> > > >> > > > >> I am still +1 on the idea to have default paused state >> on >> >> > > >> creation. >> >> > > >> > I >> >> > > >> > > > >> think we could still differentiate between initially >> paused >> >> > and >> >> > > >> > timed >> >> > > >> > > > >> out states internally by looking at pause reason. It's >> >> quite >> >> > > >> > different >> >> > > >> > > > >> if we want to store explicit NACK reasons from the >> external >> >> > > >> service >> >> > > >> > > > >> though. That would require persistence and a bit more >> >> > > complicated >> >> > > >> > > > >> logic. >> >> > > >> > > > >> >> >> > > >> > > > >> On Mon, Oct 13, 2014 at 5:15 PM, Kevin Sweeney < >> >> > > >> kevi...@apache.org> >> >> > > >> > > > wrote: >> >> > > >> > > > >> > I like the idea of implementing this scheduler-side >> >> purely >> >> > > >> through >> >> > > >> > > > >> volatile >> >> > > >> > > > >> > state, but the lack of feedback (generic vs specific >> >> error >> >> > > >> > messages >> >> > > >> > > > when >> >> > > >> > > > >> an >> >> > > >> > > > >> > update is paused) leaves something to be desired. >> Maybe >> >> we >> >> > > can >> >> > > >> > > address >> >> > > >> > > > >> that >> >> > > >> > > > >> > with a metadata field in the initial call to >> startUpdate >> >> > > (with >> >> > > >> an >> >> > > >> > > > >> optional >> >> > > >> > > > >> > link to a page where one can get more rich information >> >> > about >> >> > > the >> >> > > >> > > > state of >> >> > > >> > > > >> > the monitor sending/not sending heartbeats). >> >> > > >> > > > >> > >> >> > > >> > > > >> > The main drawback is that we may have to wait a >> maximum >> >> of >> >> > > one >> >> > > >> > > > heartbeat >> >> > > >> > > > >> > interval to find out that an update should be paused. >> >> > > >> > > > >> > >> >> > > >> > > > >> > On Mon, Oct 13, 2014 at 4:55 PM, Maxim Khutornenko < >> >> > > >> > > ma...@apache.org> >> >> > > >> > > > >> wrote: >> >> > > >> > > > >> > >> >> > > >> > > > >> >> The main reason I preferred the lack-of-ACK approach >> >> over >> >> > an >> >> > > >> > > explicit >> >> > > >> > > > >> >> NACK one is simplicity. As Joshua pointed out there >> is >> >> > more >> >> > > >> state >> >> > > >> > > to >> >> > > >> > > > >> >> handle in that case. The lack-of-ACK model can be >> >> > completely >> >> > > >> > > > >> >> implemented in volatile memory sidestepping the >> >> persistent >> >> > > >> > storage >> >> > > >> > > > >> >> entirely. With the NACK we would need to reliably >> >> persist >> >> > > >> > external >> >> > > >> > > > >> >> service call reasons to survive scheduler failovers. >> >> Not a >> >> > > huge >> >> > > >> > > > >> >> challenge but something to keep in mind. >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> I still think the simplicity/reliability tradeoff is >> >> > > acceptable >> >> > > >> > > here >> >> > > >> > > > >> >> if we rely on external service to abort heartbeats in >> >> case >> >> > > of a >> >> > > >> > > > health >> >> > > >> > > > >> >> alert fired. This can be explicitly documented as an >> >> > > external >> >> > > >> > > > >> >> integration requirement. However, If the consensus >> is to >> >> > go >> >> > > a >> >> > > >> > more >> >> > > >> > > > >> >> reliable (though more complicated) NACK route I am >> happy >> >> > to >> >> > > >> > > > reconsider >> >> > > >> > > > >> >> the current proposal. >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> On Mon, Oct 13, 2014 at 3:50 PM, Joshua Cohen < >> >> > > >> > > > jco...@twopensource.com> >> >> > > >> > > > >> >> wrote: >> >> > > >> > > > >> >> > "The heratbeatJobUpdate RPC serves as an ACK, but >> we >> >> > don't >> >> > > >> > have a >> >> > > >> > > > >> NACK. >> >> > > >> > > > >> >> If >> >> > > >> > > > >> >> > we are going to let lack-of-ACK serve as the NACK, >> i >> >> > don't >> >> > > >> > think >> >> > > >> > > > it's >> >> > > >> > > > >> >> safe >> >> > > >> > > > >> >> > to resume when we receive another ACK. In other >> >> words, >> >> > a >> >> > > >> > service >> >> > > >> > > > >> >> toggling >> >> > > >> > > > >> >> > unhealthy might not be deemed safe to proceed." >> >> > > >> > > > >> >> > >> >> > > >> > > > >> >> > Lack-of-ACK is the scenario where connectivity >> between >> >> > the >> >> > > >> > > monitor >> >> > > >> > > > and >> >> > > >> > > > >> >> the >> >> > > >> > > > >> >> > scheduler is unavailable. Shouldn't the NACK >> scenario >> >> > > >> > (everything >> >> > > >> > > > is >> >> > > >> > > > >> not >> >> > > >> > > > >> >> > ok!) be handled by the monitoring service >> triggering >> >> an >> >> > > >> > explicit >> >> > > >> > > > >> pause? >> >> > > >> > > > >> >> > I.e. section 2 should be updated to say "External >> >> > service >> >> > > >> > detects >> >> > > >> > > > >> service >> >> > > >> > > > >> >> > health problems and pauses the update" and section >> 4 >> >> > > becomes >> >> > > >> > the >> >> > > >> > > > >> current >> >> > > >> > > > >> >> > section 2 (i.e. "Should a heartbeat not be received >> >> the >> >> > > >> > scheduler >> >> > > >> > > > >> pauses >> >> > > >> > > > >> >> > the update."). >> >> > > >> > > > >> >> > >> >> > > >> > > > >> >> > I agree that it's unsafe to to resume updates after >> >> > > >> receiving a >> >> > > >> > > > >> heartbeat >> >> > > >> > > > >> >> > after previously pausing due to a missed >> heartbeat. In >> >> > > that >> >> > > >> > > > scenario >> >> > > >> > > > >> I'd >> >> > > >> > > > >> >> > think we'd want an explicit resumeJobUpdate. If the >> >> > > scenario >> >> > > >> > > we're >> >> > > >> > > > >> trying >> >> > > >> > > > >> >> > to handle is *never* received a heartbeat, that's a >> >> > > separate >> >> > > >> > > > matter, >> >> > > >> > > > >> in >> >> > > >> > > > >> >> > that case unpausing upon receiving the first >> heartbeat >> >> > > would >> >> > > >> > make >> >> > > >> > > > >> sense, >> >> > > >> > > > >> >> > but it feels like that complicates things quite a >> bit >> >> > > (now we >> >> > > >> > > need >> >> > > >> > > > to >> >> > > >> > > > >> >> > differentiate between heartbeat #1 and hearbeat >> #N). >> >> > > >> > > > >> >> > >> >> > > >> > > > >> >> > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner < >> >> > > >> > wfar...@apache.org >> >> > > >> > > > >> >> > > >> > > > >> wrote: >> >> > > >> > > > >> >> > >> >> > > >> > > > >> >> >> What is the guidance for deploying while the >> >> heartbeat >> >> > > >> service >> >> > > >> > > is >> >> > > >> > > > >> >> broken? >> >> > > >> > > > >> >> >> I think i know the answer, but it's important to >> >> spell >> >> > > out. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> > Create a new coordinated job update in a paused >> >> > > >> > > > >> (ROLL_FORWARD_PAUSED) >> >> > > >> > > > >> >> >> > state to avoid any progress until the first >> >> heartbeat >> >> > > call >> >> > > >> > > > arrives. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> I'm not sold on this being ultimately >> beneficial. In >> >> > the >> >> > > >> > worst >> >> > > >> > > > case, >> >> > > >> > > > >> >> >> impact is still limited by the health check >> >> threshold. >> >> > > >> Seems >> >> > > >> > > like >> >> > > >> > > > >> >> >> premature optimization at best, and an odd one if >> we >> >> > > proceed >> >> > > >> > > > without >> >> > > >> > > > >> a >> >> > > >> > > > >> >> >> 'NACK' signal via the heartbeatJobUpdate RPC. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat >> >> update >> >> > > via >> >> > > >> a >> >> > > >> > > > >> >> >> > resumeJobUpdate call. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> Are heartbeats required while rolling back? If >> so, >> >> > that >> >> > > >> might >> >> > > >> > > > impact >> >> > > >> > > > >> >> the >> >> > > >> > > > >> >> >> design here and in other places. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> Allow resuming of the paused-due-to-no-heartbeat >> >> update >> >> > > via >> >> > > >> a >> >> > > >> > > > fresh >> >> > > >> > > > >> >> >> > heartbeatJobUpdate call. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> The heratbeatJobUpdate RPC serves as an ACK, but >> we >> >> > don't >> >> > > >> > have a >> >> > > >> > > > >> NACK. >> >> > > >> > > > >> >> If >> >> > > >> > > > >> >> >> we are going to let lack-of-ACK serve as the >> NACK, i >> >> > > don't >> >> > > >> > think >> >> > > >> > > > it's >> >> > > >> > > > >> >> safe >> >> > > >> > > > >> >> >> to resume when we receive another ACK. In other >> >> > words, a >> >> > > >> > > service >> >> > > >> > > > >> >> toggling >> >> > > >> > > > >> >> >> unhealthy might not be deemed safe to proceed. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> Perhaps just sending OK (or a NOOP equivalent) in >> >> case >> >> > > of a >> >> > > >> > > > >> user-paused >> >> > > >> > > > >> >> job >> >> > > >> > > > >> >> >> > update would make more sense as there is nothing >> >> > > >> monitoring >> >> > > >> > > > service >> >> > > >> > > > >> >> could >> >> > > >> > > > >> >> >> > do in that case. This should work fine with >> >> > > pause/resume >> >> > > >> > > > >> >> -aware/-agnostic >> >> > > >> > > > >> >> >> > monitoring service implementation. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> This seems reasonable to me - heartbeats for a >> paused >> >> > > update >> >> > > >> > > > should >> >> > > >> > > > >> not >> >> > > >> > > > >> >> >> pose a risk, but can be safely ignored. >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> -=Bill >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> On Mon, Oct 13, 2014 at 12:48 PM, Maxim >> Khutornenko < >> >> > > >> > > > >> ma...@apache.org> >> >> > > >> > > > >> >> >> wrote: >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> > Agreed. That would be a logical generalization >> of >> >> the >> >> > > post >> >> > > >> > > > failover >> >> > > >> > > > >> >> >> > behavior. >> >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > I have updated the above document with the >> >> following >> >> > > >> > changes: >> >> > > >> > > > >> >> >> > - Reply with PAUSED any time a job was paused by >> >> > user; >> >> > > >> > > > >> >> >> > - Start in paused state by default. >> >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney >> < >> >> > > >> > > > >> kevi...@apache.org> >> >> > > >> > > > >> >> >> > wrote: >> >> > > >> > > > >> >> >> > > The doc mentioned that the scheduler will >> start >> >> an >> >> > > >> update >> >> > > >> > > > >> subject to >> >> > > >> > > > >> >> >> the >> >> > > >> > > > >> >> >> > > heartbeat countdown, and if it doesn't >> receive a >> >> > > >> heartbeat >> >> > > >> > > it >> >> > > >> > > > >> will >> >> > > >> > > > >> >> >> pause >> >> > > >> > > > >> >> >> > > the update. Why not start with the update >> >> > > >> > > > >> >> paused-due-to-no-heartbeat to >> >> > > >> > > > >> >> >> > > fail-fast any connectivity issues between the >> >> > service >> >> > > >> > > > providing >> >> > > >> > > > >> the >> >> > > >> > > > >> >> >> > > heartbeats and the scheduler? >> >> > > >> > > > >> >> >> > > >> >> > > >> > > > >> >> >> > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim >> >> > Khutornenko < >> >> > > >> > > > >> >> ma...@apache.org> >> >> > > >> > > > >> >> >> > > wrote: >> >> > > >> > > > >> >> >> > > >> >> > > >> > > > >> >> >> > >> Hi all, >> >> > > >> > > > >> >> >> > >> >> >> > > >> > > > >> >> >> > >> We are proposing a new feature for the >> scheduler >> >> > > >> updater, >> >> > > >> > > > which >> >> > > >> > > > >> you >> >> > > >> > > > >> >> >> > >> may find helpful. >> >> > > >> > > > >> >> >> > >> >> >> > > >> > > > >> >> >> > >> I have posed a brief feature summary here: >> >> > > >> > > > >> >> >> > >> >> >> > > >> > > > >> >> >> > >> >> >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > >> >> > > >> >> >> > > >> >> > >> >> >> https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md >> >> > > >> > > > >> >> >> > >> >> >> > > >> > > > >> >> >> > >> Please, reply with your >> >> > feedback/concerns/comments. >> >> > > >> > > > >> >> >> > >> >> >> > > >> > > > >> >> >> > >> Thanks, >> >> > > >> > > > >> >> >> > >> Maxim >> >> > > >> > > > >> >> >> > >> >> >> > > >> > > > >> >> >> > >> >> > > >> > > > >> >> >> >> >> > > >> > > > >> >> >> >> > > >> > > > >> >> >> > > >> > > > >> >> > > >> > > >> >> > > >> > >> >> > > >> >> >> > > >> >> > >> >> >> >> >> >> >> >> -- >> >> Kevin Sweeney >> >> @kts >> >> >>