Hi Xavier,

We have had OSDs backed with Samsung SSD 960 PRO 512GB nvmes which started
generating slow requests.

After running:

ceph osd tree up | grep nvme | awk '{print $4}' | xargs -P 10 -I _OSD sh -c
'BPS=$(ceph tell _OSD bench | jq -r .bytes_per_sec); MBPS=$(echo "scale=2;
$BPS/1000000" | bc -l); echo _OSD $MBPS MB/s' | sort -n -k 2 | column -t

I noticed that the data rate had dropped significantly on some of my NVMEs
(some where down from ~1000 MB/s to ~300 MB/s). This pointed me to the fact
that the NVMes not behaving as expected.

I thought it may be worth asking if you perhaps seeing something similar.

Cheers,
Tom

On Wed, Jul 24, 2019 at 6:39 PM Xavier Trilla <xavier.tri...@clouding.io>
wrote:

> Hi,
>
>
>
> We had an strange issue while adding a new OSD to our Ceph Luminous 12.2.8
> cluster. Our cluster has > 300 OSDs based on SSDs and NVMe.
>
>
>
> After adding a new OSD to the Ceph cluster one of the already running OSDs
> started to give us slow queries warnings.
>
>
>
> We checked the OSD and it was working properly, nothing strange on the
> logs and also it has disk activity. Looks like it stopped serving requests
> just for one PG.
>
>
>
> Request were just piling up, and the number of slow queries was just
> growing constantly till we restarted the OSD (All our OSDs are running
> bluestore).
>
>
>
> We’ve been checking out everything in our setup, and everything is
> properly configured (This cluster has been running for >5 years and it
> hosts several thousand VMs.)
>
>
>
> Beyond finding the real source of the issue -I guess I’ll have to add more
> OSDs and if it happens again I could just dump the stats of the OSD (ceph
> daemon osd.X dump_historic_slow_ops) – what I would like to find is a way
> to protect the cluster from this kind of issues.
>
>
>
> I mean, in some scenarios OSDs just suicide -actually I fixed the issue
> just restarting the offending OSD- but how can we deal with this kind of
> situation? I’ve been checking around, but I could not find anything
> (Obviously we could set our monitoring software to restart any OSD which
> has more than N slow queries, but I find that a little bit too aggressive).
>
>
>
> Is there anything build in Ceph to deal with these situations? A OSD
> blocking queries in a RBD scenario is a big deal, as plenty of VMs will
> have disk timeouts which can lead to the VM just panicking.
>
>
>
> Thanks!
>
> Xavier
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


-- 
Thomas Bennett

Storage Engineer at SARAO
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to