[ceph-users] OSD_FULL after OSD Node Failures

Gerard Hand Tue, 17 Dec 2024 08:16:24 -0800

Hi,

We recently had problems that meant 3 out of 32 OSD hosts went offline for 
about 10 minutes.  The hosts are now back in the cluster as expected and 
backfilling going on.  However we are seeing a couple of problems.


We are seeing:

1. Ceph is flagging a handfull of PGs as backfill_toofull when they aren't.  
https://tracker.ceph.com/issues/61839.   
2. Periodically it generates an OSD_FULL error.  

Are there any plans to look at resolving bug #61839?

When an OSD_FULL error has been reported, the OSD in question has been <75% 
usage.  The OSD wasn't used by any of the PGs reporting backfill_toofull.  I 
currently have the full_ratio set to 0.97 and nearfull_ratio set to 0.87 so the 
OSDs are nowhere near these levels.   The %raw usage of the OSDs in the cluster 
ranges from about 60-80% and the raw usage of the cluster is about 75%.   

We do not get any "near full" warnings prior to OSD_FULL being set.  Having a 
production system instantly go offline without warning isn't ideal and these 
things seem to know the least convenient moment. 

These problems only happen after a host failure.  Each time we have added 
additional OSD hosts into the cluster, the backfilling has finished without 
problems. 

We are currently running Reef 18.2.4 but I have experienced these same problems 
on Pacific 16.2.10

Has anyone else seen this behaviour?

Thanks
Gerard
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] OSD_FULL after OSD Node Failures

Reply via email to