I wanted to summarize a few details about the new SLA stats feature in the scheduler as I realized that code comments and AURORA-290<https://issues.apache.org/jira/browse/AURORA-290>may not tell the whole picture clearly.
The primary goal of the feature is collection and monitoring of Aurora job SLA (Service Level Agreements) metrics that define the contractual relationship between the Aurora/Mesos platform and hosted services. The feature is implemented as a background worker thread that periodically computes the job instance counters from the existing scheduler TaskEvents. The individual instance core metrics are refreshed every minute (configurable). The core instance counters are subsequently aggregated by a relevant grouping type before exporting to scheduler /vars endpoint, e.g.: Job, Cluster, Instance Size. Exposed metrics and their meaning: *Aggregate Regrettable Downtime (ARD) - *aggregate amount of time a job spends in a non-runnable state due to platform unavailability or scheduling delays*.* Collection scope: - Per job - sla_<job_key>_platform_uptime_percent - Per cluster - sla_cluster_platform_uptime_percent To accurately calculate ARD, we must separate platform incurred downtime from user actions that put a service instance in a non-operational state. It is simpler to isolate user-incurred downtime and treat all other downtime as platform incurred. Currently, a user can cause existing service (task) downtime in only two ways: via killTasks and restartShards RPCs. For both, their affected tasks leave an audit state transition trail relevant to ARD calculations. By applying a special “SLA meaning” to TaskEvents exposed task state transition records, we can build a deterministic downtime trace for every given service instance. A task going through a state transition carries one of three possible ARD meanings (see SlaAlgorithm.java for sla-to-task-state mapping): - Task is UP: starts a period where the task is considered to be up and running from the Aurora platform ARD standpoint. - Task is DOWN: starts a period where the task cannot reach the UP state for some non-user-related reason (see below). Counts towards instance downtime. Directly affects ARD. - Task is REMOVED from SLA: starts a period where the task is not expected to be UP due to user initiated action or failure. We ignore this period for the downtime calculation purposes. This metric is recalculated over the last sampling period (last minute) to account for any UP/DOWN/REMOVED events. It ignores any UP/DOWN events not immediately adjacent to the sampling interval as well as adjacent REMOVED events. *Median Time To Assigned (MTTA)* - the average time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric reflecting on the overall time it takes for the Aurora/Mesos to start executing user content. *Median Time To Running (MTTR)* - the average time a job waits for its tasks to reach RUNNING state. This is a comprehensive metric reflecting on the overall time it takes for the Aurora/Mesos to start executing user content. Collection scope for both: - Per job - sla_<job_key>_[mtta|mttr]_ms - Per cluster - sla_cluster_[mtta|mttr]_ms - Per instance size (small, medium, large, x-large, xx-large) - By CPU: - sla_cpu_small_[mtta|mttr]_ms - sla_cpu_medium_[mtta|mttr]_ms - sla_cpu_large_[mtta|mttr]_ms - sla_cpu_xlarge_[mtta|mttr]_ms - sla_cpu_xxlarge_[mtta|mttr]_ms - By RAM: - sla_ram_small_[mtta|mttr]_ms - sla_ram_medium_[mtta|mttr]_ms - sla_ram_large_[mtta|mttr]_ms - sla_ram_xlarge_[mtta|mttr]_ms - sla_ram_xxlarge_[mtta|mttr]_ms - By DISK: - sla_disk_small_[mtta|mttr]_ms - sla_disk_medium_[mtta|mttr]_ms - sla_disk_large_[mtta|mttr]_ms - sla_disk_xlarge_[mtta|mttr]_ms - sla_disk_xxlarge_[mtta|mttr]_ms See SlaGroup.java for more details on instance size mapping. These metrics use all instances in non-terminal states at the moment of calculation. This approach ensures that sliding MTTA and MTTR stats are sensitive enough to reflect on newly scheduled tasks and ignore terminal (i.e. KILLED) instances. The MTTA only considers instances that have already reached ASSIGNED state and ignores those that are still PENDING. Similarly, MTTR only considers instances in RUNNING state. This ensures straggler instances (e.g. with unreasonable resource constraints) do not affect metric curves. *Job Uptime* - percentage of the job instances considered to be in running state for the specified duration relative to request time. This is a decaying metric, meaning the percentage usually drops as the duration increases. Collection scope - per job at pre-defined percentiles: - sla_<job_key>_job_uptime_50_00_sec - sla_<job_key>_job_uptime_75_00_sec - sla_<job_key>_job_uptime_90_00_sec - sla_<job_key>_job_uptime_95_00_sec - sla_<job_key>_job_uptime_99_00_sec This is a scheduler version of the algorithm implemented in AURORA-205<https://issues.apache.org/jira/browse/AURORA-205> Feedback, comments and contributions are more than welcome! Thanks, Maxim