Hello, We had a drive (OSD) failed innour 5 node cluster three days ago (late afternoon of Mar 20). The PGs have sorted themselves out, but the cluster has has been recovering with backfill since then. Every time I run a 'ceph -s' it shows a little over 5% misplaced objects with several jobs of backfill_wait and some scrubbing.
What is sort of weird is that if I run the 'ceph -s' as few times in a row, I can see the the percentage of misplaced objects go down a bit but then if I leave it for a while and run 'ceph -s' again, it is still just over 5% misplaced objects but has typically slightly increased. For example, it might be 5.364% when I check it, and then after checking it several times in a row it might go down to 5.276% but then if I check it again after a few hours, it might be something like 5.478% (so still in the 5% range but slightly increased from last check) The cluster is on 10Gbit, and I have increased the max_backfills to 4 while the recovery runs, but it just doesn't seem to be making much progress. I know the failed drive needs to be replaced, but I think it is recommended to wait until the cluster is finished recovering? Your thoughts/advice (as usual) are greatly appreciated. Sent from my mobile device. Please excuse brevity and ttpos. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io