To answer OP: (1) seems perfectly reasonable, i don't foresee any pitfalls
(2) seems reasonable as well. Thrift unions help a bit here. Just spitballing, but this general arrangement comes to mind: struct TcpCheck { ... } struct HttpStatusCheck { ... } struct HttpPayloadCheck { ... } union HttpCheckCriteria { 1: HttpStatusCheck status 2: HttpPayloadCheck payload } struct HttpCheck { ... n: set<HttpCheckCriteria> criteria } union HealthCheck { 1: TcpCheck tcp 2: HttpCheck http } We could obviously get pretty complicated with this if we choose to, but starting with some opinionated defaults and an extensible structure may be key. I also agree that graceful teardown should be decoupled from health checks. -=Bill On Sat, Feb 21, 2015 at 10:35 AM, Bill Farner <wfar...@apache.org> wrote: > If i'm reading the code correctly, the only way to use mesos' health > checks is with the command executor? Can somebody check my work on that? > > Some other context around health checks to keep in mind: > - there is a review [1] in-flight for the executor to delay the transition > to RUNNING until the first positive health check [2] > - we want to make the scheduler the authority for reacting to health check > failures [3]. this is a very real concern for large services to avoid > simultaneous failures > > [1] https://reviews.apache.org/r/31104/ > [2] https://issues.apache.org/jira/browse/AURORA-894 > [3] https://issues.apache.org/jira/browse/AURORA-279 > > > -=Bill > > On Sat, Feb 21, 2015 at 3:48 AM, Erb, Stephan <stephan....@blue-yonder.com > > wrote: > >> Hi Florian, >> >> have you looked at what Mesos is already offering out of the box [1]? >> Maybe there is a way to implement your features by relying on Mesos >> directly, instead of making the Aurora implementation more flexible. >> >> As you've mentioned, the lifecycle endpoints abort and quit seem to be >> quite orthogonal to the health checking idea. I would be in favor of >> separating the different concepts. I even thought about this yesterday, >> because in our environment we only want health checking but now also have >> to pay a price of 10secs additional latency when stopping jobs due the >> graceful kill escalation. >> >> [1] >> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L141 >> >> >> Regards, >> Stephan >> >> ________________________________________ >> From: Florian Pfeiffer <florian.pfeif...@gutefrage.net> >> Sent: Saturday, February 21, 2015 4:27 AM >> To: dev@aurora.incubator.apache.org >> Subject: RFC HealthCheck >> >> Hi, >> >> I would like to start working on the Healthchecker >> >> 1) Enable configuration of the portname to which run health checks on >> (this should also tackle AURORA-321 ) >> This seems like a very small change consisting of adding a new variable >> named „port“ to the HealthCheckConfig in base.py with a default value of >> „health“ to be backwards compatible. Any pitfalls? Any objections? >> >> 2) There’s at least one ticket in jira that’s about making the endpoints >> for the health check configurable. I would like to have a health check that >> works on HTTP Status Codes, and there might be other people that are fine >> with a health check that works on checking if it’s possible to make a TCP >> connection >> >> For my use case I would probably be fine, if I add a variable „method“ to >> the HealthCheckConfig, with a default value of „classic“ for the current >> behavior and s.th<http://s.th>. like „statuscode“ for a check that’s >> very very similar to the current one in http_signaler.py but instead of >> parsing the response checks the status code (with the downside that the >> endpoints /health /abort /quitquitquit are still hardcoded) >> >> Any ideas how this can be a little bit more generic, so that if we have >> 3-5 different types of health checks we can have different arguments to >> each health check? (e.g. expected_response for the current one, >> expected_code for the status code checker, and maybe s.th<http://s.th>. >> like max_response_time for defining how fast traffic has to appear on a tcp >> connection check) >> >> >> A side question: for me it seems like /health and (/abort & >> /quitquitquit) are not very closely related. Does it make sense to have >> those 3 things grouped in the HealthCheck? >> >> >> Best, >> Florian >> >> >> >