Much appreciated. On Wed, Oct 1, 2014 at 2:11 PM, Bill Farner <wfar...@apache.org> wrote:
> Ok, when you have bandwidth to upgrade again feel free to let us know if > you would like somebody standing by in IRC to assist. > > -=Bill > > On Wed, Oct 1, 2014 at 11:04 AM, Isaac Councill <is...@hioscar.com> wrote: > > > Thanks! Comment dropped on AURORA-634. > > > > As for the error I encountered, I saw "Storage is not READY" exceptions > on > > all scheduler instances, and no leader was elected. Nothing other than > that > > jumped out as unusual in the logs - no ZK_* warnings/errors etc. > > > > Aurora came up before zookeeper, but aurora polled until zk was > available. > > Aurora also came up before a mesos master was available and committed > > suicide on registration failure. Monit restarted the service eventually > so > > that shouldn't have been a problem. > > > > Sadly, I've had to abandon full diagnosis due to time constraints. > > > > On Tue, Sep 30, 2014 at 5:33 PM, Bill Farner <wfar...@apache.org> wrote: > > > > > Firstly, please chime in on AURORA-634 to nudge us to formally document > > > this. > > > > > > There's a wealth of instrumentation exposed at /vars on the scheduler. > > To > > > rattle off a few that are a good fit for monitoring: > > > > > > task_store_LOST > > > If this value is increasing at a high rate, it's a sign of trouble. > > Note: > > > this one is not monotonically increasing, it will decrease when old > > > terminated tasks are GCed. > > > > > > scheduler_resource_offers > > > Must be increasing, rate will depend on cluster size and behavior of > > other > > > frameworks. > > > > > > jvm_uptime_secs > > > Detecting resets on this value will tell you that the scheduler is > > failing > > > to stay alive. > > > > > > framework_registered > > > If no schedulers have a '1' on this, then Aurora is not registered with > > > mesos. > > > > > > > > > > > > rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events) > > > This gives you a moving window of log append latency, A hike in this > > value > > > suggests disk IOP contention > > > > > > timed_out_tasks > > > Increase in this value indicates that Aurora is moving tasks into > > transient > > > states (e.g. ASSIGNED, KILLING), but not hearing back from mesos > > promptly. > > > > > > system_load_avg > > > A high sustained value here suggests that the machine may be > > over-utilized. > > > > > > http_500_responses_events > > > An increase here indicates internal server errors responding to RPCs > and > > > web UI loading. > > > > > > I'd love to know more about the specific issue you encountered. Do the > > > scheduler logs indicate anything unusual during the period of downtime? > > > > > > > > > -=Bill > > > > > > On Tue, Sep 30, 2014 at 1:59 PM, Isaac Councill <is...@hioscar.com> > > wrote: > > > > > > > I've been having a bad time with the great AWS Xen reboot, and > thought > > it > > > > would be a good time to revamp monitoring among other things. > > > > > > > > Do you have any recommendations for monitoring scheduler health? I've > > got > > > > my own ideas, but am more interested in learning about twitter prod > > > > monitoring. > > > > > > > > > > > > For context, last night's failure: > > > > > > > > Running aurora-scheduler from head, cut last week. Could find the > exact > > > > commit if interesting. Triple scheduler replication. > > > > > > > > 1) All cluster machines (mesos, aurora, zk) rebooted at once. Single > AZ > > > for > > > > this cluster. > > > > 2) mesos, zk came back online ok but aurora did not. > > > > 3) scheduler process and UI started but scheduler was unhealthy. > > Current > > > > monitoring cleared the down event because the processes were alive > and > > > > answering 8081. > > > > 4) recovery was not possible until I downgraded to 0.5.0-incubating, > at > > > > which point full recovery was made. > > > > > > > > > >