Firstly, please chime in on AURORA-634 to nudge us to formally document this.
There's a wealth of instrumentation exposed at /vars on the scheduler. To rattle off a few that are a good fit for monitoring: task_store_LOST If this value is increasing at a high rate, it's a sign of trouble. Note: this one is not monotonically increasing, it will decrease when old terminated tasks are GCed. scheduler_resource_offers Must be increasing, rate will depend on cluster size and behavior of other frameworks. jvm_uptime_secs Detecting resets on this value will tell you that the scheduler is failing to stay alive. framework_registered If no schedulers have a '1' on this, then Aurora is not registered with mesos. rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events) This gives you a moving window of log append latency, A hike in this value suggests disk IOP contention timed_out_tasks Increase in this value indicates that Aurora is moving tasks into transient states (e.g. ASSIGNED, KILLING), but not hearing back from mesos promptly. system_load_avg A high sustained value here suggests that the machine may be over-utilized. http_500_responses_events An increase here indicates internal server errors responding to RPCs and web UI loading. I'd love to know more about the specific issue you encountered. Do the scheduler logs indicate anything unusual during the period of downtime? -=Bill On Tue, Sep 30, 2014 at 1:59 PM, Isaac Councill <is...@hioscar.com> wrote: > I've been having a bad time with the great AWS Xen reboot, and thought it > would be a good time to revamp monitoring among other things. > > Do you have any recommendations for monitoring scheduler health? I've got > my own ideas, but am more interested in learning about twitter prod > monitoring. > > > For context, last night's failure: > > Running aurora-scheduler from head, cut last week. Could find the exact > commit if interesting. Triple scheduler replication. > > 1) All cluster machines (mesos, aurora, zk) rebooted at once. Single AZ for > this cluster. > 2) mesos, zk came back online ok but aurora did not. > 3) scheduler process and UI started but scheduler was unhealthy. Current > monitoring cleared the down event because the processes were alive and > answering 8081. > 4) recovery was not possible until I downgraded to 0.5.0-incubating, at > which point full recovery was made. >