> > the generic paused-waiting-for-heartbeat message will be quickly replaced > by "high 502 rate"
>From the doc Maxim linked, i don't believe that's the plan: External service detects service health problems and stops heartbeats > Heartbeat timeout occurs. Scheduler pauses the update. -=Bill On Mon, Oct 13, 2014 at 3:13 PM, Kevin Sweeney <kevi...@apache.org> wrote: > If the service sending the heartbeat RPC is working, the generic > paused-waiting-for-heartbeat message will be quickly replaced by "high 502 > rate". If it's not working (or has connectivity issues) we at least won't > give a false sense of progress. > > On Mon, Oct 13, 2014 at 3:09 PM, Bill Farner <wfar...@apache.org> wrote: > > > Re: user experience, NACK-via-timeout fails here as well. > > > > "PAUSED - Heartbeat not received in 60s" is objectively worse than > "PAUSED > > - Heartbeat failed: high 502 rate". > > > > This is part of the impedance mismatch i'm calling out. > > > > -=Bill > > > > On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney <kevi...@apache.org> > wrote: > > > > > On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner <wfar...@apache.org> > wrote: > > > > > > > What is the guidance for deploying while the heartbeat service is > > broken? > > > > I think i know the answer, but it's important to spell out. > > > > > > > > > > > > > > > > > Create a new coordinated job update in a paused > (ROLL_FORWARD_PAUSED) > > > > > state to avoid any progress until the first heartbeat call arrives. > > > > > > > > > > > > I'm not sold on this being ultimately beneficial. In the worst case, > > > > impact is still limited by the health check threshold. Seems like > > > > premature optimization at best, and an odd one if we proceed without > a > > > > 'NACK' signal via the heartbeatJobUpdate RPC. > > > > > > The benefit is huge IMO for quickly detecting connectivity issues > between > > > the scheduler and the heartbeat service. There's a lot more information > > > contained in the first successful heartbeat than the second, plus we > can > > > show the user a message like "PAUSED - Waiting for heartbeat". This is > a > > > better user experience than waiting for a timeout before revealing that > > > progress will never be made. > > > > > > > > > > > > > > > > > Allow resuming of the paused-due-to-no-heartbeat update via a > > > > > resumeJobUpdate call. > > > > > > > > > > > > Are heartbeats required while rolling back? If so, that might impact > > the > > > > design here and in other places. > > > > > > > > Allow resuming of the paused-due-to-no-heartbeat update via a fresh > > > > > heartbeatJobUpdate call. > > > > > > > > > > > > The heratbeatJobUpdate RPC serves as an ACK, but we don't have a > NACK. > > > If > > > > we are going to let lack-of-ACK serve as the NACK, i don't think it's > > > safe > > > > to resume when we receive another ACK. In other words, a service > > > toggling > > > > unhealthy might not be deemed safe to proceed. > > > > > > > > Perhaps just sending OK (or a NOOP equivalent) in case of a > user-paused > > > job > > > > > update would make more sense as there is nothing monitoring service > > > could > > > > > do in that case. This should work fine with pause/resume > > > -aware/-agnostic > > > > > monitoring service implementation. > > > > > > > > > > > > This seems reasonable to me - heartbeats for a paused update should > not > > > > pose a risk, but can be safely ignored. > > > > > > > > > > > > > > > > -=Bill > > > > > > > > On Mon, Oct 13, 2014 at 12:48 PM, Maxim Khutornenko < > ma...@apache.org> > > > > wrote: > > > > > > > > > Agreed. That would be a logical generalization of the post failover > > > > > behavior. > > > > > > > > > > I have updated the above document with the following changes: > > > > > - Reply with PAUSED any time a job was paused by user; > > > > > - Start in paused state by default. > > > > > > > > > > On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney < > kevi...@apache.org> > > > > > wrote: > > > > > > The doc mentioned that the scheduler will start an update subject > > to > > > > the > > > > > > heartbeat countdown, and if it doesn't receive a heartbeat it > will > > > > pause > > > > > > the update. Why not start with the update > > paused-due-to-no-heartbeat > > > to > > > > > > fail-fast any connectivity issues between the service providing > the > > > > > > heartbeats and the scheduler? > > > > > > > > > > > > On Fri, Oct 10, 2014 at 12:47 PM, Maxim Khutornenko < > > > ma...@apache.org> > > > > > > wrote: > > > > > > > > > > > >> Hi all, > > > > > >> > > > > > >> We are proposing a new feature for the scheduler updater, which > > you > > > > > >> may find helpful. > > > > > >> > > > > > >> I have posed a brief feature summary here: > > > > > >> > > > > > >> > > > > > > > > > > > > > > > https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md > > > > > >> > > > > > >> Please, reply with your feedback/concerns/comments. > > > > > >> > > > > > >> Thanks, > > > > > >> Maxim > > > > > >> > > > > > > > > > > > > > > >