Sounds good to me. On Mon, Oct 13, 2014 at 11:51 AM, Bill Farner <wfar...@apache.org> wrote:
> We had a discussion about this in our weekly community meeting in IRC > today, and after some debate there was unanimous agreement to avoid all > time control but to use the presence of the snooze file only. Below is the > excerpt from the discussion. If you disagree, feel free to continue the > discussion here. Otherwise, i suggest the patch be updated to remove all > time control. > > ## Health check snooze ## > > [Mon Oct 13 18:21:52 2014] <wfarner>: We had a review for a new feature > > move to a dev list discussion last week. Does anybody believe we did not > > achieve consensus on the approach? > > [Mon Oct 13 18:22:00 2014] <wfarner>: > https://reviews.apache.org/r/26383/ > > [Mon Oct 13 18:22:21 2014] <wfarner>: AURORA-795 > > [Mon Oct 13 18:22:27 2014] <zmanji>: There is a mailing list thread here: > > > http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201410.mbox/%3CCACGrrVnLWDU=vevaft_qn0il5c8oq7pqae-3ge5nnh6vjg4...@mail.gmail.com%3E > > [Mon Oct 13 18:22:57 2014] <zmanji>: I don’t think we have a consenus yet > > so please voice your opinion > > [Mon Oct 13 18:23:00 2014] <wickman>: I think the consensus was "touch a > > snooze file, then unlink after mtime + CONSTANT_TIMEOUT" > > [Mon Oct 13 18:23:07 2014] <wickman>: is that not correct? > > [Mon Oct 13 18:23:14 2014] <wfarner>: wickman: that was my understanding > > as well > > [Mon Oct 13 18:23:42 2014] <wickman>: the other option is "touch a file, > > and the health checker is disabled as long as that file is there." > > [Mon Oct 13 18:23:43 2014] <kts>: I still feel that we should avoid being > > too clever in our implementation here > > [Mon Oct 13 18:23:44 2014] <jcohen>: yeah, it sounded to me like that’s > > what we were coalescing on. > > [Mon Oct 13 18:23:58 2014] <wickman>: the reason that I'm less in favor > of > > that approach is that it's not really a snooze -- it's a sleep, and could > > be prone to somebody forgetting to turn it off > > [Mon Oct 13 18:24:10 2014] <wickman>: which might be okay -- i think 99 > > times out of 100, people will be snoozing so they can get the state of a > > wedged task > > [Mon Oct 13 18:24:13 2014] <wickman>: at which point they will kill when > > they're done > > [Mon Oct 13 18:24:18 2014] <wickman>: so i think there's a reasonable > > argument either way > > [Mon Oct 13 18:24:24 2014] <wfarner>: yes, i'm torn > > [Mon Oct 13 18:24:36 2014] <kts>: but we don't really know how long a > tool > > will take to get information about the wedged state > > [Mon Oct 13 18:24:47 2014] <wickman>: kts: yeah, that's why #1 might be > > more appealing > > [Mon Oct 13 18:24:58 2014] <jcohen>: kts: in that cause they could extend > > the snooze by using touch -m? > > [Mon Oct 13 18:25:05 2014] <wickman>: though you could just do (while > > true; do touch .snooze; sleep 60; done;) & > > [Mon Oct 13 18:25:46 2014] <jcohen>: I suppose it’s a question of what’s > > more likely (or more concerning): will someone forget to remove a snooze, > > or forget to extend it > > [Mon Oct 13 18:26:06 2014] <mkhutornenko>: +1 for not deleting the file. > > Avoiding FS mutation == Less complexity == less things to go wrong > > [Mon Oct 13 18:26:09 2014] <wickman>: i think it's important to look at > > why you'd want to snooze in the first place > > [Mon Oct 13 18:26:15 2014] <kts>: forget to extend means diagnostic > > information is lost forever > > [Mon Oct 13 18:26:18 2014] <wickman>: the only case i can think of is > > something in a super weird state > > [Mon Oct 13 18:26:27 2014] <wickman>: and they're almost always going to > > kill those things in the weird state when they're done > > [Mon Oct 13 18:26:32 2014] <wickman>: which would point to a permanent > > snooze > > [Mon Oct 13 18:26:38 2014] <wfarner>: that was my feeling as well > > [Mon Oct 13 18:28:17 2014] <wfarner>: should we reverse the position on > > this back to no time awareness at all? > > [Mon Oct 13 18:28:23 2014] <kts>: +1 > > [Mon Oct 13 18:28:31 2014] <mkhutornenko>: +1 > > [Mon Oct 13 18:28:40 2014] <zmanji>: +1 > > [Mon Oct 13 18:28:46 2014] <wfarner>: +1 > > > > -=Bill > > On Fri, Oct 10, 2014 at 3:32 PM, Bill Farner <wfar...@apache.org> wrote: > > > Ignore my first response, i think gmail drafts are out to get me. > > > > -=Bill > > > > On Fri, Oct 10, 2014 at 3:30 PM, Bill Farner <wfar...@apache.org> wrote: > > > >> I'm cool with #2, specifically if we do not attempt to parse the file > and > >> use that to determine the auto-expire time. > >> > >> > >> -=Bill > >> > >> On Fri, Oct 10, 2014 at 2:48 PM, Joshua Cohen <jco...@twopensource.com> > >> wrote: > >> > >>> I'm in camp #2, I don't feel that it adds a significant amount of > >>> complexity to the health check logic, and it provides a substantial > >>> safeguard against users accidentally shooting themselves in the foot by > >>> accidentally leaving a health check snoozed. > >>> > >>> On Fri, Oct 10, 2014 at 2:32 PM, Maxim Khutornenko <ma...@apache.org> > >>> wrote: > >>> > >>> > +1 to the #1. Disabling health checks is like signing a waiver where > >>> > all health check guarantees are off. > >>> > > >>> > On Fri, Oct 10, 2014 at 2:23 PM, David Pan <david.p...@gmail.com> > >>> wrote: > >>> > > Hi Aurora, > >>> > > > >>> > > I am currently working on a feature that allows for health checks > to > >>> be > >>> > > disabled temporarily for a running instance of a job. The code > >>> review > >>> > can > >>> > > be found at https://reviews.apache.org/r/26383/. The idea is that > >>> the > >>> > > presence of a special "snooze file" in the task's sandbox will > >>> trigger > >>> > the > >>> > > disabling of the health checks. > >>> > > > >>> > > Currently, the code reviewers have split off into two camps: > >>> > > 1. One set of reviewers believe that simplicity is key. Disable > the > >>> > health > >>> > > checks if the snooze file is present, enable it otherwise. > >>> > > > >>> > > 2. The other set of reviewers believe that there should be a snooze > >>> > > duration. The timer starts when the snooze file is touched. After > >>> the > >>> > > snooze duration is exhausted, the snooze file should be deleted by > >>> the > >>> > > health checker, and health checks resume. This is useful if the > >>> process > >>> > > that initially disabled the health checks dies unexpectedly, and is > >>> no > >>> > > longer there to re-enable the health checks. > >>> > > > >>> > > I would like to invite anyone interested to voice your opinions and > >>> chime > >>> > > in. > >>> > > > >>> > > Thanks, > >>> > > > >>> > > David Pan > >>> > > >>> > >> > >> > > >