[ceph-users] Re: OSD failed: still recovering

Gustavo Garcia Rondina Fri, 04 Apr 2025 12:02:40 -0700

Hi Alan,

Just to share our experience, we have a similar cluster that was in a very 
similar state after a disk failure: 6 nodes, 168 OSDs, 12 PGs inconsistent, 
little over 5% misplaced objects. It took a good while for it to sort itself 
out, but eventually it did it. Repairing the PGs with `ceph pg repair <pgid>` 
seemed to do nothing at first, but worked.


Changing the mClock profile to high_recovery_ops had a visible impact in 
recovery speed, though client I/O performance was very poor during recovery (we 
didn't mind). YMMV.

-
Gustavo


________________________________
From: Alan Murrell
Sent: Monday, March 24, 2025 8:12 AM
To: 'ceph-users'
Subject: [ceph-users] Re: OSD failed: still recovering

Hello,

Thanks for the response.

OK, good to know about the 5% misplaced objects report 😊

I just checked 'ceph -s' and the misplaced objects is showing 1.948%, but I 
suspect I will see this up to 5% or so later on 😊

It does finally look like there is progress being made, as my "active+clean" is 
currently 292 (out of a total 321 PGs), whereas it was seeming to progress 
beyond 287 or so.  This is the result:

--- START ---
root@cephnode01:/# ceph -s
  cluster:
    id:     474264fe-b00e-11ee-b586-ac1f6b0ff21a
    health: HEALTH_ERR
            1 failed cephadm daemon(s)
            622 scrub errors
            Possible data damage: 8 pgs inconsistent

  services:
    mon:         5 daemons, quorum 
cephnode01,cephnode03,cephnode02,cephnode04,cephnode05 (age 6w)
    mgr:         cephnode01.kefvmh(active, since 6w), standbys: 
cephnode03.clxwlu
    osd:         40 osds: 39 up (since 3d), 39 in (since 3d); 19 remapped pgs
    tcmu-runner: 1 portal active (1 hosts)

  data:
    pools:   4 pools, 321 pgs
    objects: 9.70M objects, 37 TiB
    usage:   117 TiB used, 234 TiB / 350 TiB avail
    pgs:     566646/29086680 objects misplaced (1.948%)
             292 active+clean
             13  active+remapped+backfilling
             7   active+clean+inconsistent
             5   active+remapped+backfill_wait
             2   active+clean+scrubbing
             1   active+clean+scrubbing+deep
             1   active+remapped+inconsistent+backfilling

  io:
    client:   30 MiB/s rd, 2.5 MiB/s wr, 187 op/s rd, 240 op/s wr
    recovery: 189 MiB/s, 48 objects/s
--- END ---

I think the main numbers I need to keep an eye on are the ones that are 
"backfilling"?  The "scrub" ones are just normal scrubs that are going on?

I am waiting for a full "active+clean" before dealing with the Scrub errors 
(which are a result of the failed OSD).

Also, my "TiB used" keeps going up as well; is that because of my lost HDD (the 
HDDs are 16TB).
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD failed: still recovering

Reply via email to