[ceph-users] How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Sagara Wijetunga
Hi all

I have a Ceph cluster (Nautilus 14.2.11) with 3 Ceph nodes.
A crash happened and all 3 Ceph nodes went down.
One (1) PG turned "active+clean+inconsistent", I tried to repair it. After the 
repair, now shows "active+clean+inconsistent+failed_repair" for the PG in the 
question and cannot bring the cluster to "active+clean".
How do I rescue the cluster? Is this a false positive?
Here are the detail:
All three Ceph nodes run ceph-mon, ceph-mgr, ceph-osd and ceph-mds.

1. ceph -s
health: HEALTH_ERR            3 scrub errors            Possible data damage: 1 
pg inconsistent
pgs:     191 active+clean             1   active+clean+inconsistent

2. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg 
inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 
pg inconsistent    pg 3.b is active+clean+inconsistent, acting [0,1,2]

3. rados list-inconsistent-pg rbd[]

4. ceph pg deep-scrub 3.b

5. ceph pg repair 3.b

6. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg 
inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 
pg inconsistent    pg 3.b is active+clean+inconsistent+failed_repair, acting 
[0,1,2]

7. rados list-inconsistent-obj 3.b --format=json-pretty{    "epoch": 4769,    
"inconsistents": []}

8. ceph pg 3.b list_unfound {    "num_missing": 0,    "num_unfound": 0,    
"objects": [],    "more": false}
Appreciate your help.
ThanksSagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Frank Schilder
I think this happens when a PG has 3 different copies and cannot decide which 
one is correct. You might have hit a very rare case. You should start with the 
scrub errors, check which PGs and which copies (OSDs) are affected. It sounds 
almost like all 3 scrub errors are on the same PG.

You might have had a combination of crash and OSD fail, your situation is 
probably not covered by "single point of failure".

In case you have a PG with scrub errors on 2 copies, you should be able to 
reconstruct the PG from the third with PG export/PG import commands.

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 13:16:08
To: ceph-users@ceph.io
Subject: [ceph-users] How to recover from 
active+clean+inconsistent+failed_repair?

Hi all

I have a Ceph cluster (Nautilus 14.2.11) with 3 Ceph nodes.
A crash happened and all 3 Ceph nodes went down.
One (1) PG turned "active+clean+inconsistent", I tried to repair it. After the 
repair, now shows "active+clean+inconsistent+failed_repair" for the PG in the 
question and cannot bring the cluster to "active+clean".
How do I rescue the cluster? Is this a false positive?
Here are the detail:
All three Ceph nodes run ceph-mon, ceph-mgr, ceph-osd and ceph-mds.

1. ceph -s
health: HEALTH_ERR3 scrub errorsPossible data damage: 1 
pg inconsistent
pgs: 191 active+clean 1   active+clean+inconsistent

2. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg 
inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 
pg inconsistentpg 3.b is active+clean+inconsistent, acting [0,1,2]

3. rados list-inconsistent-pg rbd[]

4. ceph pg deep-scrub 3.b

5. ceph pg repair 3.b

6. ceph health detailHEALTH_ERR 3 scrub errors; Possible data damage: 1 pg 
inconsistentOSD_SCRUB_ERRORS 3 scrub errorsPG_DAMAGED Possible data damage: 1 
pg inconsistentpg 3.b is active+clean+inconsistent+failed_repair, acting 
[0,1,2]

7. rados list-inconsistent-obj 3.b --format=json-pretty{    "epoch": 4769,
"inconsistents": []}

8. ceph pg 3.b list_unfound {    "num_missing": 0,"num_unfound": 0,
"objects": [],"more": false}
Appreciate your help.
ThanksSagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Sagara Wijetunga
Hi Frank

Thanks for the reply.

> I think this happens when a PG has 3 different copies and cannot decide which 
> one is correct. You might have hit a very rare case. You should start with 
> the scrub errors, check which PGs and which copies (OSDs) are affected. It 
> sounds almost like all 3 scrub errors are on the same PG.
Yes, all 3 errors are for the same PG and on the same OSD:
2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key
2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info
2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed

> You might have had a combination of crash and OSD fail, your situation is 
> probably not covered by "single point of failure".
Yes it was a complex crash, all went down.

> In case you have a PG with scrub errors on 2 copies, you should be able to 
> reconstruct the PG from the third with PG export/PG import commands.
I have not done a PG export/import before. Mind if you could send the 
instructions or a link for it.

Thanks
Sagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Frank Schilder
Hi Sagara,

looks like your situation is more complex. Before doing anything potentially 
destructive, you need to investigate some more. A possible interpretation 
(numbering just for the example):

OSD 0 PG at version 1
OSD 1 PG at version 2
OSD 2 PG has scrub error

Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward 
(OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). 
Part of the relevant information on OSD 2 seems to be unreadable, therefore pg 
repair bails out.

You need to find out if you are in this situation or some other case. If you 
are, you need to find out somehow if you need to roll back or forward. I'm 
afraid in your current situation, even taking the OSD with the scrub errors 
down will not rebuild the PG.

I would probably try:

- find out with smartctl if the OSD with scrub errors is in a pre-fail state 
(has remapped sectors)
- if it is:
  * take it down and try to make a full copy with ddrescue
  * if ddrescure manages to copy everything, copy back to a new disk and add to 
ceph
  * if ddrescue fails to copy everything, you could try if badblocks manages to 
get the disk back; ddrescue can force remappings of broken sectors 
(non-destructive read-write check) and it can happen that data becomes readable 
again, exchange the disk as soon as possible thereafter
- if the disk is healthy:
  * try to find out if you can deduce the state of the copies on every OSD

The tool for low-level operations is bluestore-tool. I never used it, so you 
need to look at the documentation.

If everything fails, I guess your last option is to decide for one of the 
copies, export it from one OSD and inject it to another one (but not any of 
0,1,2!). This will establish 2 identical copies and the third one will be 
changed to this one automatically. Note that this may lead to data loss on 
objects that were in the undefined state. As far as I can see, its only 1 
object and probably possible to recover from (backup, snapshot).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 14:05:36
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank

Thanks for the reply.

> I think this happens when a PG has 3 different copies and cannot decide which 
> one is correct. You might have hit a very rare case. You should start with 
> the scrub errors, check which PGs and which copies (OSDs) are affected. It 
> sounds almost like all 3 scrub errors are on the same PG.
Yes, all 3 errors are for the same PG and on the same OSD:
2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key
2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info
2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed

> You might have had a combination of crash and OSD fail, your situation is 
> probably not covered by "single point of failure".
Yes it was a complex crash, all went down.

> In case you have a PG with scrub errors on 2 copies, you should be able to 
> reconstruct the PG from the third with PG export/PG import commands.
I have not done a PG export/import before. Mind if you could send the 
instructions or a link for it.

Thanks
Sagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: How to recover from active+clean+inconsistent+failed_repair?

2020-11-01 Thread Frank Schilder
sorry: *badblocks* can force remappings of broken sectors (non-destructive 
read-write check)

=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Frank Schilder 
Sent: 01 November 2020 14:35:35
To: Sagara Wijetunga; ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Sagara,

looks like your situation is more complex. Before doing anything potentially 
destructive, you need to investigate some more. A possible interpretation 
(numbering just for the example):

OSD 0 PG at version 1
OSD 1 PG at version 2
OSD 2 PG has scrub error

Depending on the version of the PG on OSD 2, either OSD 0 needs to roll forward 
(OSD 2 PG at version 2), or OSD 1 needs to roll back (OSD 2 PG at version 1). 
Part of the relevant information on OSD 2 seems to be unreadable, therefore pg 
repair bails out.

You need to find out if you are in this situation or some other case. If you 
are, you need to find out somehow if you need to roll back or forward. I'm 
afraid in your current situation, even taking the OSD with the scrub errors 
down will not rebuild the PG.

I would probably try:

- find out with smartctl if the OSD with scrub errors is in a pre-fail state 
(has remapped sectors)
- if it is:
  * take it down and try to make a full copy with ddrescue
  * if ddrescure manages to copy everything, copy back to a new disk and add to 
ceph
  * if ddrescue fails to copy everything, you could try if badblocks manages to 
get the disk back; ddrescue can force remappings of broken sectors 
(non-destructive read-write check) and it can happen that data becomes readable 
again, exchange the disk as soon as possible thereafter
- if the disk is healthy:
  * try to find out if you can deduce the state of the copies on every OSD

The tool for low-level operations is bluestore-tool. I never used it, so you 
need to look at the documentation.

If everything fails, I guess your last option is to decide for one of the 
copies, export it from one OSD and inject it to another one (but not any of 
0,1,2!). This will establish 2 identical copies and the third one will be 
changed to this one automatically. Note that this may lead to data loss on 
objects that were in the undefined state. As far as I can see, its only 1 
object and probably possible to recover from (backup, snapshot).

Best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Sagara Wijetunga 
Sent: 01 November 2020 14:05:36
To: ceph-users@ceph.io
Subject: [ceph-users] Re: How to recover from 
active+clean+inconsistent+failed_repair?

Hi Frank

Thanks for the reply.

> I think this happens when a PG has 3 different copies and cannot decide which 
> one is correct. You might have hit a very rare case. You should start with 
> the scrub errors, check which PGs and which copies (OSDs) are affected. It 
> sounds almost like all 3 scrub errors are on the same PG.
Yes, all 3 errors are for the same PG and on the same OSD:
2020-11-01 18:25:09.39 osd.0 [ERR] 3.b shard 2 soid 
3:d577e975:::123675e.:head : candidate had a missing snapset key, 
candidate had a missing info key
2020-11-01 18:25:09.42 osd.0 [ERR] 3.b soid 
3:d577e975:::123675e.:head : failed to pick suitable object info
2020-11-01 18:26:33.496255 osd.0 [ERR] 3.b repair 3 errors, 0 fixed

> You might have had a combination of crash and OSD fail, your situation is 
> probably not covered by "single point of failure".
Yes it was a complex crash, all went down.

> In case you have a PG with scrub errors on 2 copies, you should be able to 
> reconstruct the PG from the third with PG export/PG import commands.
I have not done a PG export/import before. Mind if you could send the 
instructions or a link for it.

Thanks
Sagara
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] read latency

2020-11-01 Thread Tony Liu
Hi,

AWIK, the read latency primarily depends on HW latency,
not much can be tuned in SW. Is that right?

I ran a fio random read with iodepth 1 within a VM backed by
Ceph with HDD OSD and here is what I got.
=
   read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec)
slat (usec): min=4, max=181, avg=14.04, stdev=10.16
clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35
 lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51
=
I checked HDD average latency is 2.9 ms. Looks like the test
result makes perfect sense, isn't it?

If I want to get shorter latency (more IOPS), I will have to go
for better disk, eg. SSD. Right?


Thanks!
Tony
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: read latency

2020-11-01 Thread Vladimir Prokofev
Not exactly. You can also tune network/software.
Network - go for lower latency interfaces. If you have 10G go to 25G or
100G. 40G will not do though, afaik they're just 4x10G so their latency is
the same as in 10G.
Software - it's closely tied to your network card queues and processor
cores. In short - tune affinity so that the packet receive queues and osds
processes run on the same corresponding cores. Disabling process power
saving features helps a lot. Also watch out for NUMA interference.
But overall all these tricks will save you less than switching from HDD to
SSD.

пн, 2 нояб. 2020 г. в 02:45, Tony Liu :

> Hi,
>
> AWIK, the read latency primarily depends on HW latency,
> not much can be tuned in SW. Is that right?
>
> I ran a fio random read with iodepth 1 within a VM backed by
> Ceph with HDD OSD and here is what I got.
> =
>read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec)
> slat (usec): min=4, max=181, avg=14.04, stdev=10.16
> clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35
>  lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51
> =
> I checked HDD average latency is 2.9 ms. Looks like the test
> result makes perfect sense, isn't it?
>
> If I want to get shorter latency (more IOPS), I will have to go
> for better disk, eg. SSD. Right?
>
>
> Thanks!
> Tony
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: read latency

2020-11-01 Thread Tony Liu
Another confusing about read vs. random read. My understanding is
that, when fio does read, it reads from the test file sequentially.
When it does random read, it reads from the test file randomly.
That file read inside VM comes down to volume read handed by RBD
client who distributes read to PG and eventually to OSD. So a file
sequential read inside VM won't be a sequential read on OSD disk.
Is that right?
Then what difference seq. and rand. read make on OSD disk?
Is it rand. read on OSD disk for both cases?
Then how to explain the performance difference between seq. and rand.
read inside VM? (seq. read IOPS is 20x than rand. read, Ceph is
with 21 HDDs on 3 nodes, 7 on each)

Thanks!
Tony
> -Original Message-
> From: Vladimir Prokofev 
> Sent: Sunday, November 1, 2020 5:58 PM
> Cc: ceph-users 
> Subject: [ceph-users] Re: read latency
> 
> Not exactly. You can also tune network/software.
> Network - go for lower latency interfaces. If you have 10G go to 25G or
> 100G. 40G will not do though, afaik they're just 4x10G so their latency
> is the same as in 10G.
> Software - it's closely tied to your network card queues and processor
> cores. In short - tune affinity so that the packet receive queues and
> osds processes run on the same corresponding cores. Disabling process
> power saving features helps a lot. Also watch out for NUMA interference.
> But overall all these tricks will save you less than switching from HDD
> to SSD.
> 
> пн, 2 нояб. 2020 г. в 02:45, Tony Liu :
> 
> > Hi,
> >
> > AWIK, the read latency primarily depends on HW latency, not much can
> > be tuned in SW. Is that right?
> >
> > I ran a fio random read with iodepth 1 within a VM backed by Ceph with
> > HDD OSD and here is what I got.
> > =
> >read: IOPS=282, BW=1130KiB/s (1157kB/s)(33.1MiB/30001msec)
> > slat (usec): min=4, max=181, avg=14.04, stdev=10.16
> > clat (usec): min=178, max=393831, avg=3521.86, stdev=5771.35
> >  lat (usec): min=188, max=393858, avg=3536.38, stdev=5771.51
> > = I checked HDD average latency is 2.9 ms. Looks like
> > the test result makes perfect sense, isn't it?
> >
> > If I want to get shorter latency (more IOPS), I will have to go for
> > better disk, eg. SSD. Right?
> >
> >
> > Thanks!
> > Tony
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> > email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an
> email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io