Howdy, I'm interested in tackling AURORA-894, but I'm not terribly familiar with aurora, so I'd like some feedback on my design before I go forth.
Bill pointed out that the hard bit would be designing the algorithm so it doesn't DDoS the scheduler, and I think I have an idea of the possible design space. I wanted to know what you thought. A. sample the number of health checks, and send them back to the scheduler. this is pretty simple, but 99% of the time will be total noise, since the data isn't generally useful. B. the executor sends health checks until it receives an out of band request from the scheduler not to. this seems fragile (I'm imagining mismatched executors/schedulers behaving poorly) but would also probably be reasonably simple. C. a slightly more sophisticated approach might be to tell the executor how many health checks to look for, so that it could send a status update back, since status updates have reliable delivery. D. when the scheduler has finished standing up the executor, it long-polls, which also takes care of reliable delivery because it's presumably over TCP and we have total control (not having to go through mesos). I'm hesitant to do A, because it's so wasteful. B sounds fragile, so I don't want to do that one. D requires long-polling, which your client may or may not do well. I'm leaning toward C. Do you think that sounds like a reasonable approach? Thanks, Moses