Hi Xavier, We have had OSDs backed with Samsung SSD 960 PRO 512GB nvmes which started generating slow requests.
After running: ceph osd tree up | grep nvme | awk '{print $4}' | xargs -P 10 -I _OSD sh -c 'BPS=$(ceph tell _OSD bench | jq -r .bytes_per_sec); MBPS=$(echo "scale=2; $BPS/1000000" | bc -l); echo _OSD $MBPS MB/s' | sort -n -k 2 | column -t I noticed that the data rate had dropped significantly on some of my NVMEs (some where down from ~1000 MB/s to ~300 MB/s). This pointed me to the fact that the NVMes not behaving as expected. I thought it may be worth asking if you perhaps seeing something similar. Cheers, Tom On Wed, Jul 24, 2019 at 6:39 PM Xavier Trilla <xavier.tri...@clouding.io> wrote: > Hi, > > > > We had an strange issue while adding a new OSD to our Ceph Luminous 12.2.8 > cluster. Our cluster has > 300 OSDs based on SSDs and NVMe. > > > > After adding a new OSD to the Ceph cluster one of the already running OSDs > started to give us slow queries warnings. > > > > We checked the OSD and it was working properly, nothing strange on the > logs and also it has disk activity. Looks like it stopped serving requests > just for one PG. > > > > Request were just piling up, and the number of slow queries was just > growing constantly till we restarted the OSD (All our OSDs are running > bluestore). > > > > We’ve been checking out everything in our setup, and everything is > properly configured (This cluster has been running for >5 years and it > hosts several thousand VMs.) > > > > Beyond finding the real source of the issue -I guess I’ll have to add more > OSDs and if it happens again I could just dump the stats of the OSD (ceph > daemon osd.X dump_historic_slow_ops) – what I would like to find is a way > to protect the cluster from this kind of issues. > > > > I mean, in some scenarios OSDs just suicide -actually I fixed the issue > just restarting the offending OSD- but how can we deal with this kind of > situation? I’ve been checking around, but I could not find anything > (Obviously we could set our monitoring software to restart any OSD which > has more than N slow queries, but I find that a little bit too aggressive). > > > > Is there anything build in Ceph to deal with these situations? A OSD > blocking queries in a RBD scenario is a big deal, as plenty of VMs will > have disk timeouts which can lead to the VM just panicking. > > > > Thanks! > > Xavier > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Thomas Bennett Storage Engineer at SARAO
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com