[ceph-users] Sick Nautilus cluster, OOM killing OSDs, lots of osdmaps

Aaron Johnson Wed, 09 Oct 2019 08:21:29 -0700

Hi all

I have a smallish test cluster (14 servers, 84 OSDs) running 14.2.4.  Monthly 
OS patching and reboots that go along with it have resulted in the cluster 
getting very unwell.


Many of the servers in the cluster are OOM-killing the ceph-osd processes when 
they try to start.  (6 OSDs per server running on filestore.). Strace shows the 
ceph-osd processes are spending hours reading through the 220k osdmap files 
after being started.

This behavior started after we recently made it about 72% full to see how 
things behaved.  We also upgraded it to Nautilus 14.2.2 at about the same time.

I’ve tried starting just one OSD per server at a time in hopes of avoiding the 
OOM killer.  Also tried setting noin, rebooting the whole cluster, waiting a 
day, then marking each of the OSDs in manually.  The end result is the same 
either way.  About 60% of PGs are still down, 30% are peering, and the rest are 
in worse shape.

Anyone out there have suggestions about how I should go about getting this 
cluster healthy again?  Any ideas appreciated.

Thanks!

- Aaron

_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Sick Nautilus cluster, OOM killing OSDs, lots of osdmaps

Reply via email to