monitoring aurora scheduler

Isaac Councill Tue, 30 Sep 2014 14:00:45 -0700

I've been having a bad time with the great AWS Xen reboot, and thought it
would be a good time to revamp monitoring among other things.


Do you have any recommendations for monitoring scheduler health? I've got
my own ideas, but am more interested in learning about twitter prod
monitoring.


For context, last night's failure:

Running aurora-scheduler from head, cut last week. Could find the exact
commit if interesting. Triple scheduler replication.

1) All cluster machines (mesos, aurora, zk) rebooted at once. Single AZ for
this cluster.
2) mesos, zk came back online ok but aurora did not.
3) scheduler process and UI started but scheduler was unhealthy. Current
monitoring cleared the down event because the processes were alive and
answering 8081.
4) recovery was not possible until I downgraded to 0.5.0-incubating, at
which point full recovery was made.

monitoring aurora scheduler

Reply via email to