Unfortunately I don't think my change will be able to make it in as-is. As Brian Wickman pointed out, it could introduce serious problems because there are varying timeouts across the scheduler/executor, so if you set your wait time to be too high, the scheduler might start to consider the tasks lost because they stayed in the transient KILLING state for too long.
I do think the lifecycle modules idea would solve Stephan's issue. On Tue, Mar 24, 2015 at 5:06 PM, Brian Brazil <brian.bra...@boxever.com> wrote: > On 24 March 2015 at 20:57, Erb, Stephan <stephan....@blue-yonder.com> > wrote: > > > Hi everyone, > > > > we are implementing the /health endpoint in our services but omit the > > implementation of the unauthenticated lifecycle methods /quitquitquit and > > /abortabortabort. > > > > As a consequence, stopping a service is taxed by 10 seconds waiting time > > [1]. I would like to get rid of this unnecessary delay and can think of > two > > solutions: > > > > a) Only perform the escalation wait when the http_signaler reports that > > the message could be delivered to the service. This is a rather simple > and > > localized fix. > > > > b) Use another port for lifecycle events. This would require a new > > addition to the task configuration and proper plumbing throughout the > rest > > of the system. Backward compatibility could be achieved by using 'health' > > as the default lifecycle management port. > > > > Any thoughts? I would be happy with the simple solution, but in the end > > it's your call :-) > > > > __george mentioned on IRC working on a change that'll let the wait time be > configurable (which is something I also need), would that cover your use > case? > > There were also discussions on IRC about custom lifecycle modules. > > Brian > > > > > > Best Regards, > > Stephan > > > > [1] > > > https://github.com/apache/incubator-aurora/blob/master/src/main/python/apache/aurora/executor/thermos_task_runner.py#L123 >