On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org> wrote:
> What is the guidance for deploying while the heartbeat service is broken? > I think i know the answer, but it's important to spell out. > > > > > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED) > > state to avoid any progress until the first heartbeat call arrives. > > > I'm not sold on this being ultimately beneficial. In the worst case, > impact is still limited by the health check threshold. Seems like > premature optimization at best, and an odd one if we proceed without a > 'NACK' signal via the heartbeatJobUpdate RPC. The benefit is huge IMO for quickly detecting connectivity issues between the scheduler and the heartbeat service. There's a lot more information contained in the first successful heartbeat than the second, plus we can show the user a message like "PAUSED - Waiting for heartbeat". This is a better user experience than waiting for a timeout before revealing that progress will never be made. > > Allow resuming of the paused-due-to-no-heartbeat update via a > > resumeJobUpdate call. > > > Are heartbeats required while rolling back? If so, that might impact the > design here and in other places. > > Allow resuming of the paused-due-to-no-heartbeat update via a fresh > > heartbeatJobUpdate call. > > > The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. If > we are going to let lack-of-ACK serve as the NACK, i don't think it's safe > to resume when we receive another ACK. In other words, a service toggling > unhealthy might not be deemed safe to proceed. > > Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused job > > update would make more sense as there is nothing monitoring service could > > do in that case. This should work fine with pause/resume -aware/-agnostic > > monitoring service implementation. > > > This seems reasonable to me - heartbeats for a paused update should not > pose a risk, but can be safely ignored. > > > > -=Bill > > On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <ma...@apache.org> > wrote: > > > Agreed. That would be a logical generalization of the post failover > > behavior. > > > > I have updated the above document with the following changes: > > - Reply with PAUSED any time a job was paused by user; > > - Start in paused state by default. > > > > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <kevi...@apache.org> > > wrote: > > > The doc mentioned that the scheduler will start an update subject to > the > > > heartbeat countdown, and if it doesn't receive a heartbeat it will > pause > > > the update. Why not start with the update paused-due-to-no-heartbeat to > > > fail-fast any connectivity issues between the service providing the > > > heartbeats and the scheduler? > > > > > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko <ma...@apache.org> > > > wrote: > > > > > >> Hi all, > > >> > > >> We are proposing a new feature for the scheduler updater, which you > > >> may find helpful. > > >> > > >> I have posed a brief feature summary here: > > >> > > >> > > > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > >> > > >> Please, reply with your feedback/concerns/comments. > > >> > > >> Thanks, > > >> Maxim > > >> > > >