We monitor few things:
- cluster health (error only, ignoring warnings since we have separate checks for interesting things)
- if all PGs are active (number of active replicas >= min_size)
- if there are any blocked requests (it's a good indicator, in our case, that some disk is going to fail soon)
- if all monitors are up and in quorum (checking via admin socket)
- if there are any unfound objects
- if there are scrub/deep-scrub errors
- monitor clock skew


On 13.01.2017 21:35, David Turner wrote:
We don't currently monitor that, but my todo list has an item to monitor for blocked requests longer than 500 seconds to critical on. You can see how long they've been blocked for from `ceph health detail`. Our cluster doesn't need to be super fast at any given point, but it does need to be progressing.
------------------------------------------------------------------------

<https://storagecraft.com> DavidTurner | Cloud Operations Engineer | StorageCraft Technology Corporation <https://storagecraft.com>
380 Data Drive Suite 300 | Draper | Utah | 84020
Office: 801.871.2760| Mobile: 385.224.2943

------------------------------------------------------------------------

If you are not the intended recipient of this message or received it erroneously, please notify the sender and delete it, together with any attachments, and be advised that any dissemination or copying of this message is prohibited.

------------------------------------------------------------------------
------------------------------------------------------------------------
*From:* Chris Jones [cjo...@cloudm2.com]
*Sent:* Friday, January 13, 2017 1:31 PM
*To:* David Turner
*Cc:* ceph-us...@ceph.com
*Subject:* Re: [ceph-users] Ceph Monitoring

Thanks.

What about 'NN ops > 32 sec' (blocked ops) type alerts? Does anyone monitor for those type and if so what criteria do you use?

Thanks again!

On Fri, Jan 13, 2017 at 3:28 PM, David Turner <david.tur...@storagecraft.com <mailto:david.tur...@storagecraft.com>> wrote:

    We don't use many critical alerts (that will have our NOC wake up
    an engineer), but the main one that we do have is a check that
    tells us if there are 2 or more hosts with osds that are down.  We
    have clusters with 60 servers in them, so having an osd die and
    backfill off of isn't something to wake up for in the middle of
    the night, but having osds down on 2 servers is 1 osd away from
    data loss.  A quick reference to how to do this check in bash is
    below.

    hosts_with_down_osds=`ceph osd tree | grep 'host\|down' | grep -B1
    down | grep host | wc -l`
    if [ $hosts_with_down_osds -ge 2 ]
    then
        echo critical
    elif [ $hosts_with_down_osds -eq 1 ]
    then
        echo warning
    elif [ $hosts_with_down_osds -eq 0 ]
    then
        echo ok
    else
        echo unknown
    fi
    ------------------------------------------------------------------------
    <https://storagecraft.com>    DavidTurner | Cloud Operations
    Engineer | StorageCraft Technology Corporation
    <https://storagecraft.com>
    380 Data Drive Suite 300 | Draper | Utah | 84020
    Office: 801.871.2760 <tel:%28801%29%20871-2760>| Mobile:
    385.224.2943 <tel:%28385%29%20224-2943>

    ------------------------------------------------------------------------
    If you are not the intended recipient of this message or received
    it erroneously, please notify the sender and delete it, together
    with any attachments, and be advised that any dissemination or
    copying of this message is prohibited.

    ------------------------------------------------------------------------
    ------------------------------------------------------------------------
    *From:* ceph-users [ceph-users-boun...@lists.ceph.com
    <mailto:ceph-users-boun...@lists.ceph.com>] on behalf of Chris
    Jones [cjo...@cloudm2.com <mailto:cjo...@cloudm2.com>]
    *Sent:* Friday, January 13, 2017 1:15 PM
    *To:* ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>
    *Subject:* [ceph-users] Ceph Monitoring

    General question/survey:

    Those that have larger clusters, how are you doing
    alerting/monitoring? Meaning, do you trigger off of 'HEALTH_WARN',
    etc? Not really talking about collectd related but more on initial
    alerts of an issue or potential issue? What threshold do you use
    basically? Just trying to get a pulse of what others are doing.

    Thanks in advance.

-- Best Regards,
    Chris Jones
    ​Bloomberg​






--
Best Regards,
Chris Jones

cjo...@cloudm2.com <mailto:cjo...@cloudm2.com>
(p) 770.655.0770



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
PS

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to