SLA stats

Maxim Khutornenko Thu, 22 May 2014 13:29:32 -0700

I wanted to summarize a few details about the new SLA stats feature in the
scheduler as I realized that code comments and
AURORA-290<https://issues.apache.org/jira/browse/AURORA-290>may not
tell the whole picture clearly.


The primary goal of the feature is collection and monitoring of Aurora job
SLA (Service Level Agreements) metrics that define the contractual
relationship between the Aurora/Mesos platform and hosted services.


The feature is implemented as a background worker thread that periodically
computes the job instance counters from the existing scheduler TaskEvents.
The individual instance core metrics are refreshed every minute
(configurable). The core instance counters are subsequently aggregated by a
relevant grouping type before exporting to scheduler /vars endpoint, e.g.:
Job, Cluster, Instance Size.


Exposed metrics and their meaning:


*Aggregate Regrettable Downtime (ARD) - *aggregate amount of time a job
spends in a non-runnable state due to platform unavailability or scheduling
delays*.*

Collection scope:

   -

   Per job - sla_<job_key>_platform_uptime_percent
   -

   Per cluster - sla_cluster_platform_uptime_percent



To accurately calculate ARD, we must separate platform incurred downtime
from user actions that put a service instance in a non-operational state.
It is simpler to isolate user-incurred downtime and treat all other
downtime as platform incurred.

Currently, a user can cause existing service (task) downtime in only two
ways: via killTasks and restartShards RPCs. For both, their affected tasks
leave an audit state transition trail relevant to ARD calculations. By
applying a special “SLA meaning” to TaskEvents exposed task state
transition records, we can build a deterministic downtime trace for every
given service instance.

A task going through a state transition carries one of three possible ARD
meanings (see SlaAlgorithm.java for sla-to-task-state mapping):


   -

   Task is UP: starts a period where the task is considered to be up and
   running from the Aurora platform ARD standpoint.



   -

   Task is DOWN: starts a period where the task cannot reach the UP state
   for some non-user-related reason (see below). Counts towards instance
   downtime. Directly affects ARD.



   -

   Task is REMOVED from SLA: starts a period where the task is not expected
   to be UP due to user initiated action or failure. We ignore this period for
   the downtime calculation purposes.


This metric is recalculated over the last sampling period (last minute) to
account for any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not
immediately adjacent to the sampling interval as well as adjacent REMOVED
events.




*Median Time To Assigned (MTTA)* - the average time a job waits for its
tasks to reach RUNNING state. This is a comprehensive metric reflecting on
the overall time it takes for the Aurora/Mesos to start executing user
content.

*Median Time To Running (MTTR)* - the average time a job waits for its
tasks to reach RUNNING state. This is a comprehensive metric reflecting on
the overall time it takes for the Aurora/Mesos to start executing user
content.

Collection scope for both:

   -

   Per job - sla_<job_key>_[mtta|mttr]_ms
   -

   Per cluster - sla_cluster_[mtta|mttr]_ms
   -

   Per instance size (small, medium, large, x-large, xx-large)
   -

      By CPU:
      -

         sla_cpu_small_[mtta|mttr]_ms
         -

         sla_cpu_medium_[mtta|mttr]_ms
         -

         sla_cpu_large_[mtta|mttr]_ms
         -

         sla_cpu_xlarge_[mtta|mttr]_ms
         -

         sla_cpu_xxlarge_[mtta|mttr]_ms
         -

      By RAM:
      -

         sla_ram_small_[mtta|mttr]_ms
         -

         sla_ram_medium_[mtta|mttr]_ms
         -

         sla_ram_large_[mtta|mttr]_ms
         -

         sla_ram_xlarge_[mtta|mttr]_ms
         -

         sla_ram_xxlarge_[mtta|mttr]_ms
         -

      By DISK:
      -

         sla_disk_small_[mtta|mttr]_ms
         -

         sla_disk_medium_[mtta|mttr]_ms
         -

         sla_disk_large_[mtta|mttr]_ms
         -

         sla_disk_xlarge_[mtta|mttr]_ms
         -

         sla_disk_xxlarge_[mtta|mttr]_ms


See SlaGroup.java for more details on instance size mapping.

These metrics use all instances in non-terminal states at the moment of
calculation. This approach ensures that sliding MTTA and MTTR stats are
sensitive enough to reflect on newly scheduled tasks and ignore terminal
(i.e. KILLED) instances.

The MTTA only considers instances that have already reached ASSIGNED state
and ignores those that are still PENDING. Similarly, MTTR only considers
instances in RUNNING state. This ensures straggler instances (e.g. with
unreasonable resource constraints) do not affect metric curves.


*Job Uptime* - percentage of the job instances considered to be in running
state for the specified duration relative to request time. This is a
decaying metric, meaning the percentage usually drops as the duration
increases.

Collection scope - per job at pre-defined percentiles:

   -

   sla_<job_key>_job_uptime_50_00_sec
   -

   sla_<job_key>_job_uptime_75_00_sec
   -

   sla_<job_key>_job_uptime_90_00_sec
   -

   sla_<job_key>_job_uptime_95_00_sec
   -

   sla_<job_key>_job_uptime_99_00_sec


This is a scheduler version of the algorithm implemented in
AURORA-205<https://issues.apache.org/jira/browse/AURORA-205>

Feedback, comments and contributions are more than welcome!

Thanks, Maxim

SLA stats

Reply via email to