[ceph-users] How to recover from active+clean+inconsistent+failed_repair?
Hi all I have a Ceph cluster (Nautilus 14.2.11) with 3 Ceph nodes. A crash happened and all 3 Ceph nodes went down. One (1) PG turned "active+clean+inconsistent", I tried to repair it. After the repair, now shows "active+clean+inconsistent+failed_repair" for the PG in the question and cannot bring the cluster to "active+clean". How do I rescue the cluster? Is this a false positive? Here are the detail: All three Ceph nodes run ceph-mon, ceph-mgr, ceph-osd and ceph-mds. 1. ceph -s health: HEALTH_ERR 3 scrub errors Possible data damage: 1 pg inconsistent pgs: 191 active+clean 1 active+clean+inconsistent 2. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistent pg 3.b is active+clean+inconsistent, acting [0,1,2] 3. rados list-inconsistent-pg rbd[] 4. ceph pg deep-scrub 3.b 5. ceph pg repair 3.b 6. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistent pg 3.b is active+clean+inconsistent+failed_repair, acting [0,1,2] 7. rados list-inconsistent-obj 3.b --format=json-pretty{ "epoch": 4769, "inconsistents": []} 8. ceph pg 3.b list_unfound { "num_missing": 0, "num_unfound": 0, "objects": [], "more": false} Appreciate your help. ThanksSagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
I think this happens when a PG has 3 different copies and cannot decide which one is correct. You might have hit a very rare case. You should start with the scrub errors, check which PGs and which copies (OSDs) are affected. It sounds almost like all 3 scrub errors are on the same PG. You might have had a combination of crash and OSD fail, your situation is probably not covered by "single point of failure". In case you have a PG with scrub errors on 2 copies, you should be able to reconstruct the PG from the third with PG export/PG import commands. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 01 November 2020 13:16:08 To: ceph-users@ceph.io Subject: [ceph-users] How to recover from active+clean+inconsistent+failed_repair? Hi all I have a Ceph cluster (Nautilus 14.2.11) with 3 Ceph nodes. A crash happened and all 3 Ceph nodes went down. One (1) PG turned "active+clean+inconsistent", I tried to repair it. After the repair, now shows "active+clean+inconsistent+failed_repair" for the PG in the question and cannot bring the cluster to "active+clean". How do I rescue the cluster? Is this a false positive? Here are the detail: All three Ceph nodes run ceph-mon, ceph-mgr, ceph-osd and ceph-mds. 1. ceph -s health: HEALTH_ERR3 scrub errorsPossible data damage: 1 pg inconsistent pgs: 191 active+clean 1 active+clean+inconsistent 2. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistentpg 3.b is active+clean+inconsistent, acting [0,1,2] 3. rados list-inconsistent-pg rbd[] 4. ceph pg deep-scrub 3.b 5. ceph pg repair 3.b 6. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 pg inconsistentpg 3.b is active+clean+inconsistent+failed_repair, acting [0,1,2] 7. rados list-inconsistent-obj 3.b --format=json-pretty{ "epoch": 4769, "inconsistents": []} 8. ceph pg 3.b list_unfound { "num_missing": 0,"num_unfound": 0, "objects": [],"more": false} Appreciate your help. ThanksSagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
Hi Frank Thanks for the reply. > I think this happens when a PG has 3 different copies and cannot decide which > one is correct. You might have hit a very rare case. You should start with > the scrub errors, check which PGs and which copies (OSDs) are affected. It > sounds almost like all 3 scrub errors are on the same PG. Yes, all 3 errors are for the same PG and on the same OSD: 2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 3:d577e975:::123675e.:head : candidate had a missing snapset key, candidate had a missing info key 2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 3:d577e975:::123675e.:head : failed to pick suitable object info 2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed > You might have had a combination of crash and OSD fail, your situation is > probably not covered by "single point of failure". Yes it was a complex crash, all went down. > In case you have a PG with scrub errors on 2 copies, you should be able to > reconstruct the PG from the third with PG export/PG import commands. I have not done a PG export/import before. Mind if you could send the instructions or a link for it. Thanks Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
Hi Sagara, looks like your situation is more complex. Before doing anything potentially destructive, you need to investigate some more. A possible interpretation (numbering just for the example): OSD 0 PG at version 1 OSD 1 PG at version 2 OSD 2 PG has scrub error Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward (OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). Part of the relevant information on OSD 2 seems to be unreadable, therefore pg repair bails out. You need to find out if you are in this situation or some other case. If you are, you need to find out somehow if you need to roll back or forward. I'm afraid in your current situation, even taking the OSD with the scrub errors down will not rebuild the PG. I would probably try: - find out with smartctl if the OSD with scrub errors is in a pre-fail state (has remapped sectors) - if it is: * take it down and try to make a full copy with ddrescue * if ddrescure manages to copy everything, copy back to a new disk and add to ceph * if ddrescue fails to copy everything, you could try if badblocks manages to get the disk back; ddrescue can force remappings of broken sectors (non-destructive read-write check) and it can happen that data becomes readable again, exchange the disk as soon as possible thereafter - if the disk is healthy: * try to find out if you can deduce the state of the copies on every OSD The tool for low-level operations is bluestore-tool. I never used it, so you need to look at the documentation. If everything fails, I guess your last option is to decide for one of the copies, export it from one OSD and inject it to another one (but not any of 0,1,2!). This will establish 2 identical copies and the third one will be changed to this one automatically. Note that this may lead to data loss on objects that were in the undefined state. As far as I can see, its only 1 object and probably possible to recover from (backup, snapshot). Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 01 November 2020 14:05:36 To: ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank Thanks for the reply. > I think this happens when a PG has 3 different copies and cannot decide which > one is correct. You might have hit a very rare case. You should start with > the scrub errors, check which PGs and which copies (OSDs) are affected. It > sounds almost like all 3 scrub errors are on the same PG. Yes, all 3 errors are for the same PG and on the same OSD: 2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 3:d577e975:::123675e.:head : candidate had a missing snapset key, candidate had a missing info key 2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 3:d577e975:::123675e.:head : failed to pick suitable object info 2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed > You might have had a combination of crash and OSD fail, your situation is > probably not covered by "single point of failure". Yes it was a complex crash, all went down. > In case you have a PG with scrub errors on 2 copies, you should be able to > reconstruct the PG from the third with PG export/PG import commands. I have not done a PG export/import before. Mind if you could send the instructions or a link for it. Thanks Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?
sorry: *badblocks* can force remappings of broken sectors (non-destructive read-write check) = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Frank Schilder Sent: 01 November 2020 14:35:35 To: Sagara Wijetunga; ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Sagara, looks like your situation is more complex. Before doing anything potentially destructive, you need to investigate some more. A possible interpretation (numbering just for the example): OSD 0 PG at version 1 OSD 1 PG at version 2 OSD 2 PG has scrub error Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward (OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). Part of the relevant information on OSD 2 seems to be unreadable, therefore pg repair bails out. You need to find out if you are in this situation or some other case. If you are, you need to find out somehow if you need to roll back or forward. I'm afraid in your current situation, even taking the OSD with the scrub errors down will not rebuild the PG. I would probably try: - find out with smartctl if the OSD with scrub errors is in a pre-fail state (has remapped sectors) - if it is: * take it down and try to make a full copy with ddrescue * if ddrescure manages to copy everything, copy back to a new disk and add to ceph * if ddrescue fails to copy everything, you could try if badblocks manages to get the disk back; ddrescue can force remappings of broken sectors (non-destructive read-write check) and it can happen that data becomes readable again, exchange the disk as soon as possible thereafter - if the disk is healthy: * try to find out if you can deduce the state of the copies on every OSD The tool for low-level operations is bluestore-tool. I never used it, so you need to look at the documentation. If everything fails, I guess your last option is to decide for one of the copies, export it from one OSD and inject it to another one (but not any of 0,1,2!). This will establish 2 identical copies and the third one will be changed to this one automatically. Note that this may lead to data loss on objects that were in the undefined state. As far as I can see, its only 1 object and probably possible to recover from (backup, snapshot). Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Sagara Wijetunga Sent: 01 November 2020 14:05:36 To: ceph-users@ceph.io Subject: [ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair? Hi Frank Thanks for the reply. > I think this happens when a PG has 3 different copies and cannot decide which > one is correct. You might have hit a very rare case. You should start with > the scrub errors, check which PGs and which copies (OSDs) are affected. It > sounds almost like all 3 scrub errors are on the same PG. Yes, all 3 errors are for the same PG and on the same OSD: 2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 3:d577e975:::123675e.:head : candidate had a missing snapset key, candidate had a missing info key 2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 3:d577e975:::123675e.:head : failed to pick suitable object info 2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed > You might have had a combination of crash and OSD fail, your situation is > probably not covered by "single point of failure". Yes it was a complex crash, all went down. > In case you have a PG with scrub errors on 2 copies, you should be able to > reconstruct the PG from the third with PG export/PG import commands. I have not done a PG export/import before. Mind if you could send the instructions or a link for it. Thanks Sagara ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] read latency
Hi, AWIK, the read latency primarily depends on HW latency, not much can be tuned in SW. Is that right? I ran a fio random read with iodepth 1 within a VM backed by Ceph with HDD OSD and here is what I got. = read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec) slat (usec): min=4, max=181, avg=14.04, stdev=10.16 clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35 lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51 = I checked HDD average latency is 2.9 ms. Looks like the test result makes perfect sense, isn't it? If I want to get shorter latency (more IOPS), I will have to go for better disk, eg. SSD. Right? Thanks! Tony ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: read latency
Not exactly. You can also tune network/software. Network - go for lower latency interfaces. If you have 10G go to 25G or 100G. 40G will not do though, afaik they're just 4x10G so their latency is the same as in 10G. Software - it's closely tied to your network card queues and processor cores. In short - tune affinity so that the packet receive queues and osds processes run on the same corresponding cores. Disabling process power saving features helps a lot. Also watch out for NUMA interference. But overall all these tricks will save you less than switching from HDD to SSD. пн, 2 нояб. 2020 г. в 02:45, Tony Liu : > Hi, > > AWIK, the read latency primarily depends on HW latency, > not much can be tuned in SW. Is that right? > > I ran a fio random read with iodepth 1 within a VM backed by > Ceph with HDD OSD and here is what I got. > = >read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec) > slat (usec): min=4, max=181, avg=14.04, stdev=10.16 > clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35 > lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51 > = > I checked HDD average latency is 2.9 ms. Looks like the test > result makes perfect sense, isn't it? > > If I want to get shorter latency (more IOPS), I will have to go > for better disk, eg. SSD. Right? > > > Thanks! > Tony > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: read latency
Another confusing about read vs. random read. My understanding is that, when fio does read, it reads from the test file sequentially. When it does random read, it reads from the test file randomly. That file read inside VM comes down to volume read handed by RBD client who distributes read to PG and eventually to OSD. So a file sequential read inside VM won't be a sequential read on OSD disk. Is that right? Then what difference seq. and rand. read make on OSD disk? Is it rand. read on OSD disk for both cases? Then how to explain the performance difference between seq. and rand. read inside VM? (seq. read IOPS is 20x than rand. read, Ceph is with 21 HDDs on 3 nodes, 7 on each) Thanks! Tony > -Original Message- > From: Vladimir Prokofev > Sent: Sunday, November 1, 2020 5:58 PM > Cc: ceph-users > Subject: [ceph-users] Re: read latency > > Not exactly. You can also tune network/software. > Network - go for lower latency interfaces. If you have 10G go to 25G or > 100G. 40G will not do though, afaik they're just 4x10G so their latency > is the same as in 10G. > Software - it's closely tied to your network card queues and processor > cores. In short - tune affinity so that the packet receive queues and > osds processes run on the same corresponding cores. Disabling process > power saving features helps a lot. Also watch out for NUMA interference. > But overall all these tricks will save you less than switching from HDD > to SSD. > > пн, 2 нояб. 2020 г. в 02:45, Tony Liu : > > > Hi, > > > > AWIK, the read latency primarily depends on HW latency, not much can > > be tuned in SW. Is that right? > > > > I ran a fio random read with iodepth 1 within a VM backed by Ceph with > > HDD OSD and here is what I got. > > = > >read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec) > > slat (usec): min=4, max=181, avg=14.04, stdev=10.16 > > clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35 > > lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51 > > = I checked HDD average latency is 2.9 ms. Looks like > > the test result makes perfect sense, isn't it? > > > > If I want to get shorter latency (more IOPS), I will have to go for > > better disk, eg. SSD. Right? > > > > > > Thanks! > > Tony > > ___ > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > > email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io