Re: Proposal: External Update Coordination

2014-10-13 Thread Maxim Khutornenko
I am still +1 on the idea to have default paused state on creation. I think we could still differentiate between initially paused and timed out states internally by looking at pause reason. It's quite different if we want to store explicit NACK reasons from the external service though. That would r

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
I like the idea of implementing this scheduler-side purely through volatile state, but the lack of feedback (generic vs specific error messages when an update is paused) leaves something to be desired. Maybe we can address that with a metadata field in the initial call to startUpdate (with an optio

Re: Proposal: External Update Coordination

2014-10-13 Thread Maxim Khutornenko
The main reason I preferred the lack-of-ACK approach over an explicit NACK one is simplicity. As Joshua pointed out there is more state to handle in that case. The lack-of-ACK model can be completely implemented in volatile memory sidestepping the persistent storage entirely. With the NACK we would

Re: Proposal: External Update Coordination

2014-10-13 Thread Joshua Cohen
"The heratbeatJobUpdate RPC serves as an ACK, but we don't have a NACK. If we are going to let lack-of-ACK serve as the NACK, i don't think it's safe to resume when we receive another ACK. In other words, a service toggling unhealthy might not be deemed safe to proceed." Lack-of-ACK is the scena

Re: Proposal: External Update Coordination

2014-10-13 Thread Bill Farner
> > the generic paused-waiting-for-heartbeat message will be quickly replaced > by "high 502 rate" >From the doc Maxim linked, i don't believe that's the plan: External service detects service health problems and stops heartbeats > Heartbeat timeout occurs. Scheduler pauses the update. -=Bill

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
If the service sending the heartbeat RPC is working, the generic paused-waiting-for-heartbeat message will be quickly replaced by "high 502 rate". If it's not working (or has connectivity issues) we at least won't give a false sense of progress. On Mon, Oct 13, 2014 at 3:09 PM, Bill Farner wrote:

Re: Proposal: External Update Coordination

2014-10-13 Thread Bill Farner
Re: user experience, NACK-via-timeout fails here as well. "PAUSED - Heartbeat not received in 60s" is objectively worse than "PAUSED - Heartbeat failed: high 502 rate". This is part of the impedance mismatch i'm calling out. -=Bill On Mon, Oct 13, 2014 at 3:03 PM, Kevin Sweeney wrote: > On Mo

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
On Mon, Oct 13, 2014 at 2:50 PM, Bill Farner wrote: > What is the guidance for deploying while the heartbeat service is broken? > I think i know the answer, but it's important to spell out. > > > > > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED) > > state to avoid any prog

Re: Proposal: External Update Coordination

2014-10-13 Thread Bill Farner
What is the guidance for deploying while the heartbeat service is broken? I think i know the answer, but it's important to spell out. > Create a new coordinated job update in a paused (ROLL_FORWARD_PAUSED) > state to avoid any progress until the first heartbeat call arrives. I'm not sold on th

Re: Proposal: documentation before code

2014-10-13 Thread Joshua Cohen
I'm +1 on the sentiment. My main concern is with the potential for effort to be gated on documentation review. This essentially requires design review for all new features before development begins. Great in theory, but I worry in practice things may get held up while we try to work out details tha

Re: Proposal: documentation before code

2014-10-13 Thread Bill Farner
Slight change of position - i think the documentation with the pitch should reflect the end state. In other words, it should be suitable for committing on master when the feature is ready; and not be a pitch itself. This is slightly different from the doc linked above. -=Bill On Mon, Oct 13, 201

Proposal: documentation before code

2014-10-13 Thread Bill Farner
Maxim's 'External Update Coordination' thread [1] sets a precedent for pitching a feature via documentation before diving into code. I propose we standardize on this practice for big and small features. This will take some trial-and-error to get right, but i think for small changes requiring docu

Re: Proposal: External Update Coordination

2014-10-13 Thread Maxim Khutornenko
Agreed. That would be a logical generalization of the post failover behavior. I have updated the above document with the following changes: - Reply with PAUSED any time a job was paused by user; - Start in paused state by default. On Mon, Oct 13, 2014 at 11:32 AM, Kevin Sweeney wrote: > The doc

Re: Health Check Disabler Discussion

2014-10-13 Thread David Pan
Sounds good to me. On Mon, Oct 13, 2014 at 11:51 AM, Bill Farner wrote: > We had a discussion about this in our weekly community meeting in IRC > today, and after some debate there was unanimous agreement to avoid all > time control but to use the presence of the snooze file only. Below is the

Re: Health Check Disabler Discussion

2014-10-13 Thread Bill Farner
We had a discussion about this in our weekly community meeting in IRC today, and after some debate there was unanimous agreement to avoid all time control but to use the presence of the snooze file only. Below is the excerpt from the discussion. If you disagree, feel free to continue the discussi

Summary of IRC Meeting in #aurora

2014-10-13 Thread ASF IRC Bot
Summary of IRC Meeting in #aurora at Mon Oct 13 18:02:25 2014: Attendees: wickman, jcohen, wfarner, Yasumoto, kts, mkhutornenko, davelester, zmanji - Preface - Aurora doc day - 0.6.0 release - Test coverage flakiness - External update coordination - Ticket resolution field - Health check snooze

Re: Proposal: External Update Coordination

2014-10-13 Thread Kevin Sweeney
The doc mentioned that the scheduler will start an update subject to the heartbeat countdown, and if it doesn't receive a heartbeat it will pause the update. Why not start with the update paused-due-to-no-heartbeat to fail-fast any connectivity issues between the service providing the heartbeats an