I don't have any comment on Greg's specific concerns, but I agree that
conceptually that distinguishing between states that are likely to resolve
themselves and ones that require intervention would be a nice addition.

QH


On Wed, Nov 25, 2015 at 2:46 PM, Gregory Farnum <gfar...@redhat.com> wrote:

> On Wed, Nov 25, 2015 at 11:09 AM, Wido den Hollander <w...@42on.com>
> wrote:
> > Hi,
> >
> > Currently we have OK, WARN and ERR as states for a Ceph cluster.
> >
> > Now, it could happen that while a Ceph cluster is in WARN state certain
> > PGs are not available due to being in peering or any non-active+? state.
> >
> > When monitoring a Ceph cluster you usually want to see OK and not worry
> > when a cluster is in WARN.
> >
> > However, with the current situation you need to check if there are any
> > PGs in a non-active state since that means they are currently not doing
> > any I/O.
> >
> > For example, size is to 3, min_size is set to 2. One OSD fails, cluster
> > starts to recover/backfill. A second OSD fails which causes certain PGs
> > to become undersized and no longer serve I/O.
> >
> > I've seen such situations happen multiple times. VMs running and a few
> > PGs become non-active which caused about all I/O to stop effectively.
> >
> > The health stays in WARN, but a certain part of it is not serving I/O.
> >
> > My suggestion would be:
> >
> > OK: All PGs are active+clean and no other issues
> > WARN: All PGs are active+? (degraded, recovery_wait, backfilling, etc)
> > ERR: One or more PGs are not active
> > DISASTER: Anything which currently triggers ERR
> >
> > This way you can monitor for ERR. If the cluster goes into >= ERR you
> > know you have to come into action. <= WARN is just a thing you might
> > want to look in to, but not at 03:00 on Sunday morning.
> >
> > Does this sound reasonable?
>
> It sounds like basically you want a way of distinguishing between
> manual intervention required, and bad states which are going to be
> repaired on their own. That sounds like a good idea to me, but I'm not
> sure how feasible the specific thing here is. How long does a PG need
> to be in a not-active state before you shift into the alert mode? They
> can go through peering for a second or so when a node dies, and that
> will block IO but probably shouldn't trigger alerts.
> -Greg
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to