[ceph-users] Re: PG damaged "failed_repair"

Eugen Block Fri, 08 Mar 2024 00:18:29 -0800

Hi,

can you share more details? Which OSD are you trying to get out, theprimary osd.3?

Can you also share 'ceph osd df'?

It looks like a replicated pool with size 3, can you confirm with'ceph osd pool ls detail'?

Do you have logs from the crashing OSDs when you take out osd.3?
Which ceph version is this?

Thanks,
Eugen

Zitat von Romain Lebbadi-Breteau <romain.lebbadi-bret...@polymtl.ca>:

Hi,
We're a student club from Montréal where we host an Openstack cloudwith a Ceph backend for storage of virtual machines and volumesusing rbd.
Two weeks ago we received an email from our ceph cluster saying thatsome pages were damaged. We ran "sudo ceph pg repair <pg-id>" butthen there was an I/O error on the disk during the recovery ("Anunrecoverable disk media error occurred on Disk 4 in Backplane 1 ofIntegrated RAID Controller 1." and "Bad block medium error isdetected at block 0x1377e2ad on Virtual Disk 3 on Integrated RAIDController 1." messages on iDRAC).
After that, the PG we tried to repair was in the state"active+recovery_unfound+degraded". After a week, we ran the command"sudo ceph pg 2.1b mark_unfound_lost revert" to try to recover thedamaged PG. We tried to boot the virtual machine that had crashedbecause of this incident, but the volume seemed to have beencompletely erased, the "mount" command said there was no filesystemon it, so we recreated the VM from a backup.
A few days later, the same PG was once again damaged, and since weknew the physical disk on the OSD hosting one part of the PG hadproblems, we tried to "out" the OSD from the cluster. That resultedin the two other OSDs hosting copies of the problematic PG to godown, which caused timeouts on our virtual machines, so we put theOSD back in.
We then tried to repair the PG again, but that failed and the PG isnow "active+clean+inconsistent+failed_repair", and whenever it goesdown, two other OSDs from two other hosts go down too after a fewminutes, so it's impossible to replace the disk right now, even ifwe have new ones available.
We have backups for most of our services, but it would be verydisrupting to delete the whole cluster, and we don't know that to dowith the broken PG and the OSD that can't be shut down.
Any help would be really appreciated, we're not experts with Cephand Openstack, and it's likely we handled things wrong at somepoint, but we really want to go back to a healthy Ceph.
Here are some information about our cluster :

romain:step@alpha-cen ~  $ sudo ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
[ERR] OSD_SCRUB_ERRORS: 1 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg inconsistent
    pg 2.1b is active+clean+inconsistent+failed_repair, acting [3,11,0]

romain:step@alpha-cen ~  $ sudo ceph osd tree
ID  CLASS  WEIGHT    TYPE NAME           STATUS  REWEIGHT  PRI-AFF
-1         70.94226  root default
-7         20.00792      host alpha-cen
 3    hdd   1.81879          osd.3           up   1.00000  1.00000
 6    hdd   1.81879          osd.6           up   1.00000  1.00000
12    hdd   1.81879          osd.12          up   1.00000  1.00000
13    hdd   1.81879          osd.13          up   1.00000  1.00000
15    hdd   1.81879          osd.15          up   1.00000  1.00000
16    hdd   9.09520          osd.16          up   1.00000  1.00000
17    hdd   1.81879          osd.17          up   1.00000  1.00000
-5         23.64874      host beta-cen
 1    hdd   5.45749          osd.1           up   1.00000  1.00000
 4    hdd   5.45749          osd.4           up   1.00000  1.00000
 8    hdd   5.45749          osd.8           up   1.00000  1.00000
11    hdd   5.45749          osd.11          up   1.00000  1.00000
14    hdd   1.81879          osd.14          up   1.00000  1.00000
-3         27.28560      host gamma-cen
 0    hdd   9.09520          osd.0           up   1.00000  1.00000
 5    hdd   9.09520          osd.5           up   1.00000  1.00000
 9    hdd   9.09520          osd.9           up   1.00000  1.00000

romain:step@alpha-cen ~  $ sudo rados list-inconsistent-obj 2.1b
{"epoch":9787,"inconsistents":[]}

romain:step@alpha-cen ~  $ sudo ceph pg 2.1b query

https://pastebin.com/gsKCPCjr

Best regards,

Romain Lebbadi-Breteau
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG damaged "failed_repair"

Reply via email to