Re: monitoring aurora scheduler

2014-10-01 Thread Isaac Councill
Much appreciated. On Wed, Oct 1, 2014 at 2:11 PM, Bill Farner wrote: > Ok, when you have bandwidth to upgrade again feel free to let us know if > you would like somebody standing by in IRC to assist. > > -=Bill > > On Wed, Oct 1, 2014 at 11:04 AM, Isaac Councill wrote: > > > Thanks! Comment dro

Re: monitoring aurora scheduler

2014-10-01 Thread Bill Farner
Ok, when you have bandwidth to upgrade again feel free to let us know if you would like somebody standing by in IRC to assist. -=Bill On Wed, Oct 1, 2014 at 11:04 AM, Isaac Councill wrote: > Thanks! Comment dropped on AURORA-634. > > As for the error I encountered, I saw "Storage is not READY"

Re: monitoring aurora scheduler

2014-10-01 Thread Isaac Councill
Thanks! Comment dropped on AURORA-634. As for the error I encountered, I saw "Storage is not READY" exceptions on all scheduler instances, and no leader was elected. Nothing other than that jumped out as unusual in the logs - no ZK_* warnings/errors etc. Aurora came up before zookeeper, but auror

Re: monitoring aurora scheduler

2014-09-30 Thread Bill Farner
Firstly, please chime in on AURORA-634 to nudge us to formally document this. There's a wealth of instrumentation exposed at /vars on the scheduler. To rattle off a few that are a good fit for monitoring: task_store_LOST If this value is increasing at a high rate, it's a sign of trouble. Note:

monitoring aurora scheduler

2014-09-30 Thread Isaac Councill
I've been having a bad time with the great AWS Xen reboot, and thought it would be a good time to revamp monitoring among other things. Do you have any recommendations for monitoring scheduler health? I've got my own ideas, but am more interested in learning about twitter prod monitoring. For co