Re: Proposal: External Update Coordination

2014-10-16 Thread Bill Farner
+1 -=Bill On Thu, Oct 16, 2014 at 11:21 AM, Maxim Khutornenko wrote: > Correct. The presence of the SessionKey does indeed mean that > heartbeats are going to be authenticated. Given that external service > has to solve authentication story to use pauseJobUpdate anyway, having > heartbeats auth

Re: Proposal: External Update Coordination

2014-10-16 Thread Maxim Khutornenko
Correct. The presence of the SessionKey does indeed mean that heartbeats are going to be authenticated. Given that external service has to solve authentication story to use pauseJobUpdate anyway, having heartbeats authenticated seems like a natural progression. Also, given our current admin thrift

Re: Proposal: External Update Coordination

2014-10-16 Thread Kevin Sweeney
I inferred that authentication was required due to the presence of a SessionKey in the RPC. Of course any authentication mechanism here could have serious scaling issues (barring something like HTTP basic auth in memory) On Thu, Oct 16, 2014 at 10:48 AM, Joshua Cohen wrote: > What are our though

Re: Proposal: External Update Coordination

2014-10-16 Thread Joshua Cohen
What are our thoughts about authentication with regards to heartbeats? It seems like they should be authenticated since there does exist the potential for a malicious actor to send its own heartbeats even if the real monitoring service has detected a problem and ceased sending heartbeats. I'm not s

Re: Proposal: External Update Coordination

2014-10-15 Thread Bill Farner
David - the plan is to synthesize the waiting state. Exactly how is not yet certain. On Wednesday, October 15, 2014, Maxim Khutornenko wrote: > It is certainly possible to add new state or a status message but I > don't think it's a blocker for the first iteration. Provided there is > enough de

Re: Proposal: External Update Coordination

2014-10-15 Thread Maxim Khutornenko
It is certainly possible to add new state or a status message but I don't think it's a blocker for the first iteration. Provided there is enough demand a state/message could be synthesized during the 'get' call based on the volatile state. On Wed, Oct 15, 2014 at 6:36 PM, David McLaughlin wrote:

Re: Proposal: External Update Coordination

2014-10-15 Thread David McLaughlin
+1 for pause being explicit RPC pauses, but does it really add complexity to just add a new state (WAITING?) when no heartbeat is sent? Not being able to see that an update was blocked because of a lack of heartbeat seems like a missing feature. On Wed, Oct 15, 2014 at 5:12 PM, Maxim Khutornenko

Re: Proposal: External Update Coordination

2014-10-15 Thread Maxim Khutornenko
+1. Updated the doc: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md On Wed, Oct 15, 2014 at 5:09 PM, Bill Farner wrote: > +1 to the scheduler not proceeding on an update when heartbeats are absent, > and requiring the heartbeat service to explicitly call paus

Re: Proposal: External Update Coordination

2014-10-15 Thread Bill Farner
+1 to the scheduler not proceeding on an update when heartbeats are absent, and requiring the heartbeat service to explicitly call pauseJobUpdate when it detects problems. -=Bill On Wed, Oct 15, 2014 at 4:59 PM, Kevin Sweeney wrote: > Chatted with Maxim and Bill, I think we figured it out > > I

Re: Proposal: External Update Coordination

2014-10-15 Thread Kevin Sweeney
Chatted with Maxim and Bill, I think we figured it out I think the confusion stems from the fact that there are two types of pauses in this system, explicit, persisted pauses generated by the pauseJobUpdate RPC and implicit, volatile pauses caused due to the absence of a sufficiently fresh heartbe

Re: Proposal: External Update Coordination

2014-10-15 Thread Bill Farner
I think we should assess that after building the rest of the feature. IIUC the rest of the code doesn't care if the update is initially paused. -=Bill On Wed, Oct 15, 2014 at 11:50 AM, Maxim Khutornenko wrote: > Can we get a consensus here? Looks like the only sticky point left is > around sta

Re: Proposal: External Update Coordination

2014-10-15 Thread Maxim Khutornenko
Can we get a consensus here? Looks like the only sticky point left is around starting an update in paused vs. non-paused state. I can argue either way as it's easy to add later if needed. On Tue, Oct 14, 2014 at 1:03 PM, Bill Farner wrote: > I'm not arguing against the merits of the approach. Ju

Re: Proposal: External Update Coordination

2014-10-14 Thread Bill Farner
I'm not arguing against the merits of the approach. Just feeling out whether that should be done _after_ the rest of the heartbeat support. Seems like it can be cleanly added at the end to get something usable earlier. -=Bill On Tue, Oct 14, 2014 at 12:38 PM, Kevin Sweeney wrote: > I'm +1 for

Re: Proposal: External Update Coordination

2014-10-14 Thread Kevin Sweeney
I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy signal, and punting on a more complex NACK signal (which we'd have to reliably persist). I think the only disagreement in this thread is whether the default state for a new update should be running or waiting-for-heartbeat. I th

Re: Proposal: External Update Coordination

2014-10-14 Thread Kevin Sweeney
I'm +1 for using lack of heartbeats as a uniform unknown-or-unhealthy signal, and punting on a more complex NACK signal (which we'd have to reliably persist). I think the only disagreement in this thread is whether the default state for a new update should be running or waiting-for-heartbeat. I th

Re: Proposal: External Update Coordination

2014-10-14 Thread Bill Farner
Wait - simpler solution than what? We're talking about not doing either. -=Bill On Tue, Oct 14, 2014 at 12:16 PM, Kevin Sweeney wrote: > I think waiting for the first heartbeat before taking any action is the > simpler solution here as it allows the implementation to be entirely > soft-state a

Re: Proposal: External Update Coordination

2014-10-14 Thread Kevin Sweeney
I think waiting for the first heartbeat before taking any action is the simpler solution here as it allows the implementation to be entirely soft-state and still catches the bugs I described. The implementation is just PulseMonitorImpl - heartbeat calls pulse and mutation operations check isAlive.

Re: Proposal: External Update Coordination

2014-10-14 Thread Kevin Sweeney
I think waiting for the first heartbeat before taking any action is the simpler solution here as it allows the implementation to be entirely soft-state and still catches the bugs I described. The implementation is just PulseMonitorImpl - heartbeat calls pulse and mutation operations check isAlive.

Re: Proposal: External Update Coordination

2014-10-14 Thread Maxim Khutornenko
Pausing update on creation seems like a logical approach when dealing with inverted dependency model. I.e. updater is happy to act as long as it's greenlighted by the external signal. It's also aligned with a failover experience where coordinated updates are rehydrated in paused state waiting for H

Re: Proposal: External Update Coordination

2014-10-14 Thread Bill Farner
If the goal is to reduce complexity now and add features later, why not nuke both for now - kick off the update right away, and let lack of heartbeats serve as a uniform "unknown or unhealthy" signal? -=Bill On Mon, Oct 13, 2014 at 5:25 PM, Maxim Khutornenko wrote: > I am still +1 on the idea t

Re: Proposal: External Update Coordination

2014-10-13 Thread Maxim Khutornenko
I am still +1 on the idea to have default paused state on creation. I think we could still differentiate between initially paused and timed out states internally by looking at pause reason. It's quite different if we want to store explicit NACK reasons from the external service though. That would r

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
I like the idea of implementing this scheduler-side purely through volatile state, but the lack of feedback (generic vs specific error messages when an update is paused) leaves something to be desired. Maybe we can address that with a metadata field in the initial call to startUpdate (with an optio

Re: Proposal: External Update Coordination

2014-10-13 Thread Maxim Khutornenko
The main reason I preferred the lack-of-ACK approach over an explicit NACK one is simplicity. As Joshua pointed out there is more state to handle in that case. The lack-of-ACK model can be completely implemented in volatile memory sidestepping the persistent storage entirely. With the NACK we would

Re: Proposal: External Update Coordination

2014-10-13 Thread Joshua Cohen
"The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. If we are going to let lack-of-ACK serve as the NACK, i don't think it's safe to resume when we receive another ACK. In other words, a service toggling unhealthy might not be deemed safe to proceed." Lack-of-ACK is the scena

Re: Proposal: External Update Coordination

2014-10-13 Thread Bill Farner
> > the generic paused-waiting-for-heartbeat message will be quickly replaced > by "high 502 rate" >From the doc Maxim linked, i don't believe that's the plan: External service detects service health problems and stops heartbeats > Heartbeat timeout occurs. Scheduler pauses the update. -=Bill

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
If the service sending the heartbeat RPC is working, the generic paused-waiting-for-heartbeat message will be quickly replaced by "high 502 rate". If it's not working (or has connectivity issues) we at least won't give a false sense of progress. On Mon, Oct 13, 2014 at 3:09 PM, Bill Farner wrote:

Re: Proposal: External Update Coordination

2014-10-13 Thread Bill Farner
Re: user experience, NACK-via-timeout fails here as well. "PAUSED - Heartbeat not received in 60s" is objectively worse than "PAUSED - Heartbeat failed: high 502 rate". This is part of the impedance mismatch i'm calling out. -=Bill On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney wrote: > On Mo

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner wrote: > What is the guidance for deploying while the heartbeat service is broken? > I think i know the answer, but it's important to spell out. > > > > > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED) > > state to avoid any prog

Re: Proposal: External Update Coordination

2014-10-13 Thread Bill Farner
What is the guidance for deploying while the heartbeat service is broken? I think i know the answer, but it's important to spell out. > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED) > state to avoid any progress until the first heartbeat call arrives. I'm not sold on th

Re: Proposal: External Update Coordination

2014-10-13 Thread Maxim Khutornenko
Agreed. That would be a logical generalization of the post failover behavior. I have updated the above document with the following changes: - Reply with PAUSED any time a job was paused by user; - Start in paused state by default. On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney wrote: > The doc

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
The doc mentioned that the scheduler will start an update subject to the heartbeat countdown, and if it doesn't receive a heartbeat it will pause the update. Why not start with the update paused-due-to-no-heartbeat to fail-fast any connectivity issues between the service providing the heartbeats an

Re: Proposal: External Update Coordination

2014-10-10 Thread Maxim Khutornenko
I presume you are referring to case #5? Joshua correctly pointed to the assumption I made (should have probably documented it) is that the pause/resume actions would issue relevant stop/start calls for the monitoring service to either suspend or resume heartbeats. You may argue it applies unnecess

Re: Proposal: External Update Coordination

2014-10-10 Thread Joshua Cohen
In theory when the user resumes they'd also resume monitoring (and thus resume heartbeats)? Maybe the resumeJobUpdate RPC needs to support pauseIfNoHeartbeatsAfterMs as well? I'm not sure what a monitoring/heartbeating service would do with its record of a paused job while it's paused. How would i

Re: Proposal: External Update Coordination

2014-10-10 Thread David McLaughlin
- A heartbeatJobUpdate RPC is called with the matching update ID. Scheduler resets countdown and responds with STOP Paused is a tricky state because the user can resume at any time. I'd propose we have a different response here. You really don't want to "stop" monitoring the update while it

Proposal: External Update Coordination

2014-10-10 Thread Maxim Khutornenko
Hi all, We are proposing a new feature for the scheduler updater, which you may find helpful. I have posed a brief feature summary here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md Please, reply with your feedback/concerns/comments. Thanks, Maxim