Hi,

We had an strange issue while adding a new OSD to our Ceph Luminous 12.2.8 
cluster. Our cluster has > 300 OSDs based on SSDs and NVMe.

After adding a new OSD to the Ceph cluster one of the already running OSDs 
started to give us slow queries warnings.

We checked the OSD and it was working properly, nothing strange on the logs and 
also it has disk activity. Looks like it stopped serving requests just for one 
PG.

Request were just piling up, and the number of slow queries was just growing 
constantly till we restarted the OSD (All our OSDs are running bluestore).

We've been checking out everything in our setup, and everything is properly 
configured (This cluster has been running for >5 years and it hosts several 
thousand VMs.)

Beyond finding the real source of the issue -I guess I'll have to add more OSDs 
and if it happens again I could just dump the stats of the OSD (ceph daemon 
osd.X dump_historic_slow_ops) - what I would like to find is a way to protect 
the cluster from this kind of issues.

I mean, in some scenarios OSDs just suicide -actually I fixed the issue just 
restarting the offending OSD- but how can we deal with this kind of situation? 
I've been checking around, but I could not find anything (Obviously we could 
set our monitoring software to restart any OSD which has more than N slow 
queries, but I find that a little bit too aggressive).

Is there anything build in Ceph to deal with these situations? A OSD blocking 
queries in a RBD scenario is a big deal, as plenty of VMs will have disk 
timeouts which can lead to the VM just panicking.

Thanks!
Xavier

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to