If the service sending the heartbeat RPC is working, the generic paused-waiting-for-heartbeat message will be quickly replaced by "high 502 rate". If it's not working (or has connectivity issues) we at least won't give a false sense of progress.
On Mon, Oct 13, 2014 at 3:09 PM, Bill Farner <wfar...@apache.org> wrote: > Re: user experience, NACK-via-timeout fails here as well. > > "PAUSED - Heartbeat not received in 60s" is objectively worse than "PAUSED > - Heartbeat failed: high 502 rate". > > This is part of the impedance mismatch i'm calling out. > > -=Bill > > On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney <kevi...@apache.org> wrote: > > > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org> wrote: > > > > > What is the guidance for deploying while the heartbeat service is > broken? > > > I think i know the answer, but it's important to spell out. > > > > > > > > > > > > > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED) > > > > state to avoid any progress until the first heartbeat call arrives. > > > > > > > > > I'm not sold on this being ultimately beneficial. In the worst case, > > > impact is still limited by the health check threshold. Seems like > > > premature optimization at best, and an odd one if we proceed without a > > > 'NACK' signal via the heartbeatJobUpdate RPC. > > > > The benefit is huge IMO for quickly detecting connectivity issues between > > the scheduler and the heartbeat service. There's a lot more information > > contained in the first successful heartbeat than the second, plus we can > > show the user a message like "PAUSED - Waiting for heartbeat". This is a > > better user experience than waiting for a timeout before revealing that > > progress will never be made. > > > > > > > > > > > > Allow resuming of the paused-due-to-no-heartbeat update via a > > > > resumeJobUpdate call. > > > > > > > > > Are heartbeats required while rolling back? If so, that might impact > the > > > design here and in other places. > > > > > > Allow resuming of the paused-due-to-no-heartbeat update via a fresh > > > > heartbeatJobUpdate call. > > > > > > > > > The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. > > If > > > we are going to let lack-of-ACK serve as the NACK, i don't think it's > > safe > > > to resume when we receive another ACK. In other words, a service > > toggling > > > unhealthy might not be deemed safe to proceed. > > > > > > Perhaps just sending OK (or a NOOP equivalent) in case of a user-paused > > job > > > > update would make more sense as there is nothing monitoring service > > could > > > > do in that case. This should work fine with pause/resume > > -aware/-agnostic > > > > monitoring service implementation. > > > > > > > > > This seems reasonable to me - heartbeats for a paused update should not > > > pose a risk, but can be safely ignored. > > > > > > > > > > > > -=Bill > > > > > > On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko <ma...@apache.org> > > > wrote: > > > > > > > Agreed. That would be a logical generalization of the post failover > > > > behavior. > > > > > > > > I have updated the above document with the following changes: > > > > - Reply with PAUSED any time a job was paused by user; > > > > - Start in paused state by default. > > > > > > > > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney <kevi...@apache.org> > > > > wrote: > > > > > The doc mentioned that the scheduler will start an update subject > to > > > the > > > > > heartbeat countdown, and if it doesn't receive a heartbeat it will > > > pause > > > > > the update. Why not start with the update > paused-due-to-no-heartbeat > > to > > > > > fail-fast any connectivity issues between the service providing the > > > > > heartbeats and the scheduler? > > > > > > > > > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko < > > ma...@apache.org> > > > > > wrote: > > > > > > > > > >> Hi all, > > > > >> > > > > >> We are proposing a new feature for the scheduler updater, which > you > > > > >> may find helpful. > > > > >> > > > > >> I have posed a brief feature summary here: > > > > >> > > > > >> > > > > > > > > > > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > > > >> > > > > >> Please, reply with your feedback/concerns/comments. > > > > >> > > > > >> Thanks, > > > > >> Maxim > > > > >> > > > > > > > > > >