To answer OP:

(1) seems perfectly reasonable, i don't foresee any pitfalls

(2) seems reasonable as well.  Thrift unions help a bit here.  Just
spitballing, but this general arrangement comes to mind:

struct TcpCheck {
  ...
}

struct HttpStatusCheck {
  ...
}

struct HttpPayloadCheck {
  ...
}


union HttpCheckCriteria {

  1: HttpStatusCheck status

  2: HttpPayloadCheck payload

}


struct HttpCheck {
  ...

  n: set<HttpCheckCriteria> criteria

}

union HealthCheck {
  1: TcpCheck tcp
  2: HttpCheck http
}


We could obviously get pretty complicated with this if we choose to, but
starting with some opinionated defaults and an extensible structure may be
key.

I also agree that graceful teardown should be decoupled from health checks.

-=Bill

On Sat, Feb 21, 2015 at 10:35 AM, Bill Farner <wfar...@apache.org> wrote:

> If i'm reading the code correctly, the only way to use mesos' health
> checks is with the command executor?  Can somebody check my work on that?
>
> Some other context around health checks to keep in mind:
> - there is a review [1] in-flight for the executor to delay the transition
> to RUNNING until the first positive health check [2]
> - we want to make the scheduler the authority for reacting to health check
> failures [3].  this is a very real concern for large services to avoid
> simultaneous failures
>
> [1] https://reviews.apache.org/r/31104/
> [2] https://issues.apache.org/jira/browse/AURORA-894
> [3] https://issues.apache.org/jira/browse/AURORA-279
>
>
> -=Bill
>
> On Sat, Feb 21, 2015 at 3:48 AM, Erb, Stephan <stephan....@blue-yonder.com
> > wrote:
>
>> Hi Florian,
>>
>> have you looked at what Mesos is already offering out of the box [1]?
>> Maybe there is a way to implement your features by relying on Mesos
>> directly, instead of making the Aurora implementation more flexible.
>>
>> As you've mentioned, the  lifecycle endpoints abort and quit seem to be
>> quite orthogonal to the health checking idea. I would be in favor of
>> separating the different concepts. I even thought about this yesterday,
>> because in our environment we only want health checking but now also have
>> to pay a  price of 10secs additional latency when stopping jobs due the
>> graceful kill escalation.
>>
>> [1]
>> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L141
>>
>>
>> Regards,
>> Stephan
>>
>> ________________________________________
>> From: Florian Pfeiffer <florian.pfeif...@gutefrage.net>
>> Sent: Saturday, February 21, 2015 4:27 AM
>> To: dev@aurora.incubator.apache.org
>> Subject: RFC HealthCheck
>>
>> Hi,
>>
>> I would like to start working on the Healthchecker
>>
>> 1) Enable configuration of the portname to which run health checks on
>> (this should also tackle AURORA-321 )
>> This seems like a very small change consisting of adding a new variable
>> named „port“ to the HealthCheckConfig  in base.py with a default value of
>> „health“ to be backwards compatible. Any pitfalls? Any objections?
>>
>> 2) There’s at least one ticket in jira that’s about making the endpoints
>> for the health check configurable. I would like to have a health check that
>> works on HTTP Status Codes, and there might be other people that are fine
>> with a health check that works on checking if it’s possible to make a TCP
>> connection
>>
>> For my use case I would probably be fine, if I add a variable „method“ to
>> the HealthCheckConfig, with a  default value of „classic“ for the current
>> behavior and s.th<http://s.th>. like „statuscode“ for a check that’s
>> very very similar to the current one in http_signaler.py but instead of
>> parsing the response checks the status code (with the downside that the
>> endpoints /health /abort /quitquitquit are still hardcoded)
>>
>> Any ideas how this can be a little bit more generic, so that if we have
>> 3-5 different types of health checks we can have different arguments to
>> each health check? (e.g. expected_response for the current one,
>> expected_code for the status code checker, and maybe s.th<http://s.th>.
>> like max_response_time for defining how fast traffic has to appear on a tcp
>> connection check)
>>
>>
>> A side question: for me it seems like /health and (/abort &
>> /quitquitquit) are not very closely related. Does it make sense to have
>> those 3 things grouped in the HealthCheck?
>>
>>
>> Best,
>> Florian
>>
>>
>>
>

Reply via email to