Re: Aurora Documentation Sprint, next Thursday 10/16

2014-10-10 Thread Bill Farner
Also, consider this open call for documentation tickets. Whether it's for bad, outdated, or lacking docs, file a ticket to let us know! On Thursday, October 9, 2014, Dave Lester wrote: > Hi All, > > We've organized an informal documentation sprint for next Thursday > (10/16), where several of u

Re: Aurora Documentation Sprint, next Thursday 10/16

2014-10-10 Thread Joe Stein
I had some issues building and running but worked through them yesterday. I can work through it again and pull together a patch. I think I need to get this to run https://svn.apache.org/repos/asf/incubator/aurora/site/ for that? Could we also change the "roles" in the documentation. Maybe we cal

Re: Aurora Documentation Sprint, next Thursday 10/16

2014-10-10 Thread Bill Farner
Our usual flow for documentation is to edit markdown files under docs/ [1] and use github as the testbed when posting a review. [1] https://github.com/apache/incubator-aurora/tree/master/docs -=Bill On Fri, Oct 10, 2014 at 5:36 AM, Joe Stein wrote: > I had some issues building and running but

Proposal: External Update Coordination

2014-10-10 Thread Maxim Khutornenko
Hi all, We are proposing a new feature for the scheduler updater, which you may find helpful. I have posed a brief feature summary here: https://github.com/maxim111333/incubator-aurora/blob/hb_doc/docs/update-heartbeat.md Please, reply with your feedback/concerns/comments. Thanks, Maxim

Re: Proposal: External Update Coordination

2014-10-10 Thread David McLaughlin
- A heartbeatJobUpdate RPC is called with the matching update ID. Scheduler resets countdown and responds with STOP Paused is a tricky state because the user can resume at any time. I'd propose we have a different response here. You really don't want to "stop" monitoring the update while it

Re: Proposal: External Update Coordination

2014-10-10 Thread Joshua Cohen
In theory when the user resumes they'd also resume monitoring (and thus resume heartbeats)? Maybe the resumeJobUpdate RPC needs to support pauseIfNoHeartbeatsAfterMs as well? I'm not sure what a monitoring/heartbeating service would do with its record of a paused job while it's paused. How would i

Re: Proposal: External Update Coordination

2014-10-10 Thread Maxim Khutornenko
I presume you are referring to case #5? Joshua correctly pointed to the assumption I made (should have probably documented it) is that the pause/resume actions would issue relevant stop/start calls for the monitoring service to either suspend or resume heartbeats. You may argue it applies unnecess

Health Check Disabler Discussion

2014-10-10 Thread David Pan
Hi Aurora, I am currently working on a feature that allows for health checks to be disabled temporarily for a running instance of a job. The code review can be found at https://reviews.apache.org/r/26383/. The idea is that the presence of a special "snooze file" in the task's sandbox will trigge

Re: Health Check Disabler Discussion

2014-10-10 Thread Maxim Khutornenko
+1 to the #1. Disabling health checks is like signing a waiver where all health check guarantees are off. On Fri, Oct 10, 2014 at 2:23 PM, David Pan wrote: > Hi Aurora, > > I am currently working on a feature that allows for health checks to be > disabled temporarily for a running instance of a j

Re: Health Check Disabler Discussion

2014-10-10 Thread Zameer Manji
+1 #2. We don't surface disabling health checks anywhere to the user. I think the system should err on the side of caution and get to the state that it is advertising on the UI. On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko wrote: > +1 to the #1. Disabling health checks is like signing a wa

Re: Health Check Disabler Discussion

2014-10-10 Thread Joshua Cohen
I'm in camp #2, I don't feel that it adds a significant amount of complexity to the health check logic, and it provides a substantial safeguard against users accidentally shooting themselves in the foot by accidentally leaving a health check snoozed. On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutorne

Re: Health Check Disabler Discussion

2014-10-10 Thread Bill Farner
I'm generally in #1, but could land somewhere in between. I think the idea of using mtime came up, which i like more than parsing the snooze file and giving full control. I'd be fine with expiring this file at mtime + SNOOZE_TIMEOUT (constant). This fails closed, is relatively simple to implemen

Re: Health Check Disabler Discussion

2014-10-10 Thread Bill Farner
I'm cool with #2, specifically if we do not attempt to parse the file and use that to determine the auto-expire time. -=Bill On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen wrote: > I'm in camp #2, I don't feel that it adds a significant amount of > complexity to the health check logic, and it p

Re: Health Check Disabler Discussion

2014-10-10 Thread Bill Farner
Ignore my first response, i think gmail drafts are out to get me. -=Bill On Fri, Oct 10, 2014 at 3:30 PM, Bill Farner wrote: > I'm cool with #2, specifically if we do not attempt to parse the file and > use that to determine the auto-expire time. > > > -=Bill > > On Fri, Oct 10, 2014 at 2:48 PM