[ceph-users] OSD failed: still recovering

Alan Murrell Sun, 23 Mar 2025 21:58:10 -0700

Hello,

We had a drive (OSD) failed innour 5 node cluster three days ago (late 
afternoon of Mar 20).  The PGs have sorted themselves out, but the cluster has 
has been recovering with backfill since then.  Every time I run a 'ceph -s' it 
shows a little over 5% misplaced objects with several jobs of backfill_wait and 
some scrubbing.


What is sort of weird is that if I run the 'ceph -s' as few times in a row, I 
can see the the percentage of misplaced objects go down a bit but then if I 
leave it for a while and run 'ceph -s' again, it is still just over 5% 
misplaced objects but has typically slightly increased.

For example, it might be 5.364% when I check it, and then after checking it 
several times in a row it might go down to 5.276% but then if I check it again 
after a few hours, it might be something like 5.478% (so still in the 5% range 
but slightly increased from last check)

The cluster is on 10Gbit, and I have increased the max_backfills to 4 while the 
recovery runs, but it just doesn't seem to be making much progress.

I know the failed drive needs to be replaced, but I think it is recommended to 
wait until the cluster is finished recovering?

Your thoughts/advice (as usual) are greatly appreciated.

Sent from my mobile device.  Please excuse brevity and ttpos.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] OSD failed: still recovering

Reply via email to