I think this approach is cleaner than relying on framework messages, and less fragile since we don't have to count on unreliable delivery.
One other nice externality is that I always thought it was a little confusing that "RUNNING" happened before we were sure it was healthy. I think the new definition would be more intuitive. On Thu Dec 18 2014 at 9:59:27 AM Brian Wickman <wick...@apache.org> wrote: > One nit: s/max_consecutive_successes/min_consecutive_successes/ This is > "once the minimum number of consecutive passed health checks has been > reached, transition to RUNNING." > > My take: I think this approach is strictly better than relying upon > framework messages for correct update behavior. Leveraging the status > update semantics provided by Mesos means the executor/scheduler remain > decoupled and thus can still support any arbitrary Mesos executor launched > by Aurora. > > That being said, it will probably necessitate a change in design of how the > StatusCheckers work inside of the existing Aurora executor, which could be > a fairly significant amount of work though probably less significant than > required for the scheduler state machine. I think it's worth it in the > long run, but I'm not sure what the short term priority of this work should > be. > > ~brian > > On Wed, Dec 17, 2014 at 6:20 PM, Maxim Khutornenko <ma...@apache.org> > wrote: > > > > Resending as my original post got dropped somehow. > > > > Here is in-person discussion follow up. Participants: Moses, wickman, > > kevints, maxim. > > > > The proposal we came up with does not require implementing scheduler > > health checks (AURORA-279). The idea is to require the executor to > > move a task from STARTING to RUNNING only when its health checks are > > satisfied. This will make the updater go faster by relying directly on > > RUNNING status update, which is now going to be a true reflection of a > > healthy user task. The watch_secs will still be useful for updating > > tasks without the health checks enabled. > > > > Below is a high level summary of required changes (incomplete). > > > > Scheduler: > > - Modify task state machine to treat STARTING as a new active > > (non-transient) state > > - Modify Preemptor to account for STARTING > > - Modify stats and SLA metrics to properly account for STARTING > > - Modify scheduler updater to short-circuit watch_secs when health > > checks are enabled > > > > Schema: > > - Add max_consecutive_successes setting into HealthCheckConfig [1] to > > instruct the executor when to move task into RUNNING. > > > > Executor: > > - Modify state transition logic to rely on health checks (if enabled) > > to move the task into RUNNING. Transition from STARTING to RUNNING > > immediately if task health checks are disabled. > > > > Open question: with STARTING becoming a non-transient state from the > > scheduler standpoint, there is nothing to enforce its exit. This may > > be OK as STARTING will effectively be a stable user defined state. > > However, this is something we may want to cap to avoid adverse user > > impact. > > > > Thoughts? > > > > Thanks, > > Maxim > > > > [1] - > > https://github.com/apache/incubator-aurora/blob/master/ > docs/configuration-reference.md#healthcheckconfig-objects > > > > On Sat, Dec 13, 2014 at 11:06 AM, Nakamura <nny...@gmail.com> wrote: > > > Hey, > > > Just wanted to make sure my email didn't get lost in the cracks. > > > > > > As a reminder, the previous emails in this thread were: > > > Bill Farner > > > < > > http://mail-archives.apache.org/mod_mbox/incubator-aurora- > dev/201412.mbox/ajax/%3CCAGRA8uMpWyhcV-hxLU%3Dw7twDD7jbffu39TmbX5MPiXQE8je > xtA%40mail.gmail.com%3E > > > > > > Brian Wickman > > > < > > http://mail-archives.apache.org/mod_mbox/incubator-aurora- > dev/201412.mbox/ajax/%3CCAFTdr0DerXKtK%2BhGrJDN0VU- > RgQ8sisCKaAZ3Jzg11BTzea5gw%40mail.gmail.com%3E > > > > > > > > > Best, > > > Moses > > > > > > On Thu Dec 04 2014 at 11:14:02 AM Nakamura <nny...@gmail.com> wrote: > > > > > >> Hey, > > >> > > >> Sorry that this is replying to my own email, I didn't realize that I > had > > >> to subscribe to the dev@aurora listserv to get updates. This email > > >> should really be in response to Brian Wickman's response. > > >> > > >> Hmm, I don't think only sending the transitions is sufficient though. > > My > > >> concern is that since sending framework messages isn't reliable, we > > could > > >> end up in a situation where the scheduler perceives the task is > healthy > > >> even though it's not. > > >> > > >> 1. scheduler spins up executor > > >> 2. executor unhealthy > > >> 3. executor transitions to healthy, sends message to scheduler > > >> 4. scheduler receives healthy message > > >> 5. executor transitions to unhealthy before N healthy messages, sends > > >> message to scheduler > > >> 6. scheduler does not receive unhealthy message > > >> 7. after waiting for N messages * time between messages without a > > >> response, it assumes that it has remained healthy and marks it as > > healthy > > >> enough to continue. > > >> > > >> We can fix this by changing 7 to include the check that's currently > > >> included in the watch_secs delayed action. > > >> > > >> Here is my new proposal for how B should work: > > >> > > >> Executor sends health transitions as framework messages to the > > >> scheduler. When the scheduler receives a transition to healthiness, > it > > >> waits for N messages * time between messages, and then sends a request > > to > > >> ask if the executor is still healthy. If the scheduler never sees a > > >> healthy message, it defaults to the old behavior, sending a request at > > >> watch_secs. Once the scheduler no longer needs the transitions, it > tells > > >> the executor to stop sending the messages. > > >> > > >> Thoughts? Are there any easy ways I can simplify the design? > > >> > > >> Best, > > >> Moses > > >> > > >> On Tue Dec 02 2014 at 1:53:24 PM Nakamura <nny...@gmail.com> wrote: > > >> > > >>> Howdy, > > >>> > > >>> I'm interested in tackling AURORA-894, but I'm not terribly familiar > > with > > >>> aurora, so I'd like some feedback on my design before I go forth. > > >>> > > >>> Bill pointed out that the hard bit would be designing the algorithm > so > > it > > >>> doesn't DDoS the scheduler, and I think I have an idea of the > possible > > >>> design space. I wanted to know what you thought. > > >>> > > >>> A. sample the number of health checks, and send them back to the > > >>> scheduler. this is pretty simple, but 99% of the time will be total > > noise, > > >>> since the data isn't generally useful. > > >>> > > >>> B. the executor sends health checks until it receives an out of band > > >>> request from the scheduler not to. this seems fragile (I'm imagining > > >>> mismatched executors/schedulers behaving poorly) but would also > > probably be > > >>> reasonably simple. > > >>> > > >>> C. a slightly more sophisticated approach might be to tell the > > executor > > >>> how many health checks to look for, so that it could send a status > > update > > >>> back, since status updates have reliable delivery. > > >>> > > >>> D. when the scheduler has finished standing up the executor, it > > >>> long-polls, which also takes care of reliable delivery because it's > > >>> presumably over TCP and we have total control (not having to go > through > > >>> mesos). > > >>> > > >>> I'm hesitant to do A, because it's so wasteful. B sounds fragile, > so I > > >>> don't want to do that one. D requires long-polling, which your > client > > may > > >>> or may not do well. I'm leaning toward C. Do you think that sounds > > like a > > >>> reasonable approach? > > >>> > > >>> Thanks, > > >>> Moses > > >>> > > >> > > >