Aha, docs in the proto file confirm my read of the implementation: // A health check for the task (currently in *alpha* and initial > // support will only be for TaskInfo's that have a CommandInfo). > optional HealthCheck health_check = 8;
-=Bill On Sat, Feb 21, 2015 at 10:47 AM, Bill Farner <wfar...@apache.org> wrote: > To answer OP: > > (1) seems perfectly reasonable, i don't foresee any pitfalls > > (2) seems reasonable as well. Thrift unions help a bit here. Just > spitballing, but this general arrangement comes to mind: > > struct TcpCheck { > ... > } > > struct HttpStatusCheck { > ... > } > > struct HttpPayloadCheck { > ... > } > > > union HttpCheckCriteria { > > 1: HttpStatusCheck status > > 2: HttpPayloadCheck payload > > } > > > struct HttpCheck { > ... > > n: set<HttpCheckCriteria> criteria > > } > > union HealthCheck { > 1: TcpCheck tcp > 2: HttpCheck http > } > > > We could obviously get pretty complicated with this if we choose to, but > starting with some opinionated defaults and an extensible structure may be > key. > > I also agree that graceful teardown should be decoupled from health checks. > > -=Bill > > On Sat, Feb 21, 2015 at 10:35 AM, Bill Farner <wfar...@apache.org> wrote: > >> If i'm reading the code correctly, the only way to use mesos' health >> checks is with the command executor? Can somebody check my work on that? >> >> Some other context around health checks to keep in mind: >> - there is a review [1] in-flight for the executor to delay the >> transition to RUNNING until the first positive health check [2] >> - we want to make the scheduler the authority for reacting to health >> check failures [3]. this is a very real concern for large services to >> avoid simultaneous failures >> >> [1] https://reviews.apache.org/r/31104/ >> [2] https://issues.apache.org/jira/browse/AURORA-894 >> [3] https://issues.apache.org/jira/browse/AURORA-279 >> >> >> -=Bill >> >> On Sat, Feb 21, 2015 at 3:48 AM, Erb, Stephan < >> stephan....@blue-yonder.com> wrote: >> >>> Hi Florian, >>> >>> have you looked at what Mesos is already offering out of the box [1]? >>> Maybe there is a way to implement your features by relying on Mesos >>> directly, instead of making the Aurora implementation more flexible. >>> >>> As you've mentioned, the lifecycle endpoints abort and quit seem to be >>> quite orthogonal to the health checking idea. I would be in favor of >>> separating the different concepts. I even thought about this yesterday, >>> because in our environment we only want health checking but now also have >>> to pay a price of 10secs additional latency when stopping jobs due the >>> graceful kill escalation. >>> >>> [1] >>> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L141 >>> >>> >>> Regards, >>> Stephan >>> >>> ________________________________________ >>> From: Florian Pfeiffer <florian.pfeif...@gutefrage.net> >>> Sent: Saturday, February 21, 2015 4:27 AM >>> To: dev@aurora.incubator.apache.org >>> Subject: RFC HealthCheck >>> >>> Hi, >>> >>> I would like to start working on the Healthchecker >>> >>> 1) Enable configuration of the portname to which run health checks on >>> (this should also tackle AURORA-321 ) >>> This seems like a very small change consisting of adding a new variable >>> named „port“ to the HealthCheckConfig in base.py with a default value of >>> „health“ to be backwards compatible. Any pitfalls? Any objections? >>> >>> 2) There’s at least one ticket in jira that’s about making the endpoints >>> for the health check configurable. I would like to have a health check that >>> works on HTTP Status Codes, and there might be other people that are fine >>> with a health check that works on checking if it’s possible to make a TCP >>> connection >>> >>> For my use case I would probably be fine, if I add a variable „method“ >>> to the HealthCheckConfig, with a default value of „classic“ for the >>> current behavior and s.th<http://s.th>. like „statuscode“ for a check >>> that’s very very similar to the current one in http_signaler.py but instead >>> of parsing the response checks the status code (with the downside that the >>> endpoints /health /abort /quitquitquit are still hardcoded) >>> >>> Any ideas how this can be a little bit more generic, so that if we have >>> 3-5 different types of health checks we can have different arguments to >>> each health check? (e.g. expected_response for the current one, >>> expected_code for the status code checker, and maybe s.th<http://s.th>. >>> like max_response_time for defining how fast traffic has to appear on a tcp >>> connection check) >>> >>> >>> A side question: for me it seems like /health and (/abort & >>> /quitquitquit) are not very closely related. Does it make sense to have >>> those 3 things grouped in the HealthCheck? >>> >>> >>> Best, >>> Florian >>> >>> >>> >> >