Much appreciated.

On Wed, Oct 1, 2014 at 2:11 PM, Bill Farner <wfar...@apache.org> wrote:

> Ok, when you have bandwidth to upgrade again feel free to let us know if
> you would like somebody standing by in IRC to assist.
>
> -=Bill
>
> On Wed, Oct 1, 2014 at 11:04 AM, Isaac Councill <is...@hioscar.com> wrote:
>
> > Thanks! Comment dropped on AURORA-634.
> >
> > As for the error I encountered, I saw "Storage is not READY" exceptions
> on
> > all scheduler instances, and no leader was elected. Nothing other than
> that
> > jumped out as unusual in the logs - no ZK_* warnings/errors etc.
> >
> > Aurora came up before zookeeper, but aurora polled until zk was
> available.
> > Aurora also came up before a mesos master was available and committed
> > suicide on registration failure. Monit restarted the service eventually
> so
> > that shouldn't have been a problem.
> >
> > Sadly, I've had to abandon full diagnosis due to time constraints.
> >
> > On Tue, Sep 30, 2014 at 5:33 PM, Bill Farner <wfar...@apache.org> wrote:
> >
> > > Firstly, please chime in on AURORA-634 to nudge us to formally document
> > > this.
> > >
> > > There's a wealth of instrumentation exposed at /vars on the scheduler.
> > To
> > > rattle off a few that are a good fit for monitoring:
> > >
> > > task_store_LOST
> > > If this value is increasing at a high rate, it's a sign of trouble.
> > Note:
> > > this one is not monotonically increasing, it will decrease when old
> > > terminated tasks are GCed.
> > >
> > > scheduler_resource_offers
> > > Must be increasing, rate will depend on cluster size and behavior of
> > other
> > > frameworks.
> > >
> > > jvm_uptime_secs
> > > Detecting resets on this value will tell you that the scheduler is
> > failing
> > > to stay alive.
> > >
> > > framework_registered
> > > If no schedulers have a '1' on this, then Aurora is not registered with
> > > mesos.
> > >
> > >
> > >
> >
> rate(scheduler_log_native_append_nanos_total)/rate(scheduler_log_native_append_events)
> > > This gives you a moving window of log append latency,  A hike in this
> > value
> > > suggests disk IOP contention
> > >
> > > timed_out_tasks
> > > Increase in this value indicates that Aurora is moving tasks into
> > transient
> > > states (e.g. ASSIGNED, KILLING), but not hearing back from mesos
> > promptly.
> > >
> > > system_load_avg
> > > A high sustained value here suggests that the machine may be
> > over-utilized.
> > >
> > > http_500_responses_events
> > > An increase here indicates internal server errors responding to RPCs
> and
> > > web UI loading.
> > >
> > > I'd love to know more about the specific issue you encountered.  Do the
> > > scheduler logs indicate anything unusual during the period of downtime?
> > >
> > >
> > > -=Bill
> > >
> > > On Tue, Sep 30, 2014 at 1:59 PM, Isaac Councill <is...@hioscar.com>
> > wrote:
> > >
> > > > I've been having a bad time with the great AWS Xen reboot, and
> thought
> > it
> > > > would be a good time to revamp monitoring among other things.
> > > >
> > > > Do you have any recommendations for monitoring scheduler health? I've
> > got
> > > > my own ideas, but am more interested in learning about twitter prod
> > > > monitoring.
> > > >
> > > >
> > > > For context, last night's failure:
> > > >
> > > > Running aurora-scheduler from head, cut last week. Could find the
> exact
> > > > commit if interesting. Triple scheduler replication.
> > > >
> > > > 1) All cluster machines (mesos, aurora, zk) rebooted at once. Single
> AZ
> > > for
> > > > this cluster.
> > > > 2) mesos, zk came back online ok but aurora did not.
> > > > 3) scheduler process and UI started but scheduler was unhealthy.
> > Current
> > > > monitoring cleared the down event because the processes were alive
> and
> > > > answering 8081.
> > > > 4) recovery was not possible until I downgraded to 0.5.0-incubating,
> at
> > > > which point full recovery was made.
> > > >
> > >
> >
>

Reply via email to