Hi Florian, have you looked at what Mesos is already offering out of the box [1]? Maybe there is a way to implement your features by relying on Mesos directly, instead of making the Aurora implementation more flexible.
As you've mentioned, the lifecycle endpoints abort and quit seem to be quite orthogonal to the health checking idea. I would be in favor of separating the different concepts. I even thought about this yesterday, because in our environment we only want health checking but now also have to pay a price of 10secs additional latency when stopping jobs due the graceful kill escalation. [1] https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L141 Regards, Stephan ________________________________________ From: Florian Pfeiffer <florian.pfeif...@gutefrage.net> Sent: Saturday, February 21, 2015 4:27 AM To: dev@aurora.incubator.apache.org Subject: RFC HealthCheck Hi, I would like to start working on the Healthchecker 1) Enable configuration of the portname to which run health checks on (this should also tackle AURORA-321 ) This seems like a very small change consisting of adding a new variable named „port“ to the HealthCheckConfig in base.py with a default value of „health“ to be backwards compatible. Any pitfalls? Any objections? 2) There’s at least one ticket in jira that’s about making the endpoints for the health check configurable. I would like to have a health check that works on HTTP Status Codes, and there might be other people that are fine with a health check that works on checking if it’s possible to make a TCP connection For my use case I would probably be fine, if I add a variable „method“ to the HealthCheckConfig, with a default value of „classic“ for the current behavior and s.th<http://s.th>. like „statuscode“ for a check that’s very very similar to the current one in http_signaler.py but instead of parsing the response checks the status code (with the downside that the endpoints /health /abort /quitquitquit are still hardcoded) Any ideas how this can be a little bit more generic, so that if we have 3-5 different types of health checks we can have different arguments to each health check? (e.g. expected_response for the current one, expected_code for the status code checker, and maybe s.th<http://s.th>. like max_response_time for defining how fast traffic has to appear on a tcp connection check) A side question: for me it seems like /health and (/abort & /quitquitquit) are not very closely related. Does it make sense to have those 3 things grouped in the HealthCheck? Best, Florian