Hey, Just wanted to make sure my email didn't get lost in the cracks. As a reminder, the previous emails in this thread were: Bill Farner <http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/ajax/%3CCAGRA8uMpWyhcV-hxLU%3Dw7twDD7jbffu39TmbX5MPiXQE8jextA%40mail.gmail.com%3E> Brian Wickman <http://mail-archives.apache.org/mod_mbox/incubator-aurora-dev/201412.mbox/ajax/%3CCAFTdr0DerXKtK%2BhGrJDN0VU-RgQ8sisCKaAZ3Jzg11BTzea5gw%40mail.gmail.com%3E>
Best, Moses On Thu Dec 04 2014 at 11:14:02 AM Nakamura <nny...@gmail.com> wrote: > Hey, > > Sorry that this is replying to my own email, I didn't realize that I had > to subscribe to the dev@aurora listserv to get updates. This email > should really be in response to Brian Wickman's response. > > Hmm, I don't think only sending the transitions is sufficient though. My > concern is that since sending framework messages isn't reliable, we could > end up in a situation where the scheduler perceives the task is healthy > even though it's not. > > 1. scheduler spins up executor > 2. executor unhealthy > 3. executor transitions to healthy, sends message to scheduler > 4. scheduler receives healthy message > 5. executor transitions to unhealthy before N healthy messages, sends > message to scheduler > 6. scheduler does not receive unhealthy message > 7. after waiting for N messages * time between messages without a > response, it assumes that it has remained healthy and marks it as healthy > enough to continue. > > We can fix this by changing 7 to include the check that's currently > included in the watch_secs delayed action. > > Here is my new proposal for how B should work: > > Executor sends health transitions as framework messages to the > scheduler. When the scheduler receives a transition to healthiness, it > waits for N messages * time between messages, and then sends a request to > ask if the executor is still healthy. If the scheduler never sees a > healthy message, it defaults to the old behavior, sending a request at > watch_secs. Once the scheduler no longer needs the transitions, it tells > the executor to stop sending the messages. > > Thoughts? Are there any easy ways I can simplify the design? > > Best, > Moses > > On Tue Dec 02 2014 at 1:53:24 PM Nakamura <nny...@gmail.com> wrote: > >> Howdy, >> >> I'm interested in tackling AURORA-894, but I'm not terribly familiar with >> aurora, so I'd like some feedback on my design before I go forth. >> >> Bill pointed out that the hard bit would be designing the algorithm so it >> doesn't DDoS the scheduler, and I think I have an idea of the possible >> design space. I wanted to know what you thought. >> >> A. sample the number of health checks, and send them back to the >> scheduler. this is pretty simple, but 99% of the time will be total noise, >> since the data isn't generally useful. >> >> B. the executor sends health checks until it receives an out of band >> request from the scheduler not to. this seems fragile (I'm imagining >> mismatched executors/schedulers behaving poorly) but would also probably be >> reasonably simple. >> >> C. a slightly more sophisticated approach might be to tell the executor >> how many health checks to look for, so that it could send a status update >> back, since status updates have reliable delivery. >> >> D. when the scheduler has finished standing up the executor, it >> long-polls, which also takes care of reliable delivery because it's >> presumably over TCP and we have total control (not having to go through >> mesos). >> >> I'm hesitant to do A, because it's so wasteful. B sounds fragile, so I >> don't want to do that one. D requires long-polling, which your client may >> or may not do well. I'm leaning toward C. Do you think that sounds like a >> reasonable approach? >> >> Thanks, >> Moses >> >