Hi Alan, Just to share our experience, we have a similar cluster that was in a very similar state after a disk failure: 6 nodes, 168 OSDs, 12 PGs inconsistent, little over 5% misplaced objects. It took a good while for it to sort itself out, but eventually it did it. Repairing the PGs with `ceph pg repair <pgid>` seemed to do nothing at first, but worked.
Changing the mClock profile to high_recovery_ops had a visible impact in recovery speed, though client I/O performance was very poor during recovery (we didn't mind). YMMV. - Gustavo ________________________________ From: Alan Murrell Sent: Monday, March 24, 2025 8:12 AM To: 'ceph-users' Subject: [ceph-users] Re: OSD failed: still recovering Hello, Thanks for the response. OK, good to know about the 5% misplaced objects report 😊 I just checked 'ceph -s' and the misplaced objects is showing 1.948%, but I suspect I will see this up to 5% or so later on 😊 It does finally look like there is progress being made, as my "active+clean" is currently 292 (out of a total 321 PGs), whereas it was seeming to progress beyond 287 or so. This is the result: --- START --- root@cephnode01:/# ceph -s cluster: id: 474264fe-b00e-11ee-b586-ac1f6b0ff21a health: HEALTH_ERR 1 failed cephadm daemon(s) 622 scrub errors Possible data damage: 8 pgs inconsistent services: mon: 5 daemons, quorum cephnode01,cephnode03,cephnode02,cephnode04,cephnode05 (age 6w) mgr: cephnode01.kefvmh(active, since 6w), standbys: cephnode03.clxwlu osd: 40 osds: 39 up (since 3d), 39 in (since 3d); 19 remapped pgs tcmu-runner: 1 portal active (1 hosts) data: pools: 4 pools, 321 pgs objects: 9.70M objects, 37 TiB usage: 117 TiB used, 234 TiB / 350 TiB avail pgs: 566646/29086680 objects misplaced (1.948%) 292 active+clean 13 active+remapped+backfilling 7 active+clean+inconsistent 5 active+remapped+backfill_wait 2 active+clean+scrubbing 1 active+clean+scrubbing+deep 1 active+remapped+inconsistent+backfilling io: client: 30 MiB/s rd, 2.5 MiB/s wr, 187 op/s rd, 240 op/s wr recovery: 189 MiB/s, 48 objects/s --- END --- I think the main numbers I need to keep an eye on are the ones that are "backfilling"? The "scrub" ones are just normal scrubs that are going on? I am waiting for a full "active+clean" before dealing with the Scrub errors (which are a result of the failed OSD). Also, my "TiB used" keeps going up as well; is that because of my lost HDD (the HDDs are 16TB). _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io