I've been having a bad time with the great AWS Xen reboot, and thought it would be a good time to revamp monitoring among other things.
Do you have any recommendations for monitoring scheduler health? I've got my own ideas, but am more interested in learning about twitter prod monitoring. For context, last night's failure: Running aurora-scheduler from head, cut last week. Could find the exact commit if interesting. Triple scheduler replication. 1) All cluster machines (mesos, aurora, zk) rebooted at once. Single AZ for this cluster. 2) mesos, zk came back online ok but aurora did not. 3) scheduler process and UI started but scheduler was unhealthy. Current monitoring cleared the down event because the processes were alive and answering 8081. 4) recovery was not possible until I downgraded to 0.5.0-incubating, at which point full recovery was made.