[ceph-users] Re: PG down, due to 3 OSD failing

Dan van der Ster Fri, 01 Apr 2022 02:27:03 -0700

The PGs are stale, down, inactive *because* the OSDs don't start.
Your main efforts should be to bring OSDs up, without purging or
zapping or anyting like that.
(Currently your cluster is down, but there are hopes to recover. If
you start purging things that can result in permanent data loss.).


More below.

On Fri, Apr 1, 2022 at 9:38 AM Fulvio Galeazzi <fulvio.galea...@garr.it> wrote:
>
> Ciao Dan,
>      thanks for your time!
>
> So you are suggesting that my problems with PG 85.25 may somehow resolve
> if I manage to bring up the three OSDs currently "down" (possibly due to
> PG 85.12, and other PGs)?

Yes, that's exactly what I'm suggesting.

> Looking for the string 'start interval does not contain the required
> bound' I found similar errors in the three OSDs:
> osd.158: 85.12s0
> osd.145: 85.33s0
> osd.121: 85.11s0

Is that log also for PG 85.12 on the other OSDs?

> Here is the output of "pg 85.12 query":
>         https://pastebin.ubuntu.com/p/ww3JdwDXVd/
>   and its status (also showing the other 85.XX, for reference):

This is very weird:

    "up": [
        2147483647,
        2147483647,
        2147483647,
        2147483647,
        2147483647
    ],
    "acting": [
        67,
        91,
        82,
        2147483647,
        112
    ],

Right now, do the following:
  ceph osd set norebalance
That will prevent PGs moving from one OSD to another *unless* they are degraded.

2. My theory about what happened here. Your crush rule change "osd ->
host" below basically asked all PGs to be moved.
Some glitch happened and some broken parts of PG 85.12 ended up on
some OSDs, now causing those OSDs to crash.
85.12 is "fine", I mean active, now because there are enough complete
parts of it on other osds.
The fact that "up" above is listing '2147483647' for every osd means
your new crush rule is currently broken. Let's deal with fixing that
later.

3. Question -- what is the output of `ceph osd pool ls detail | grep
csd-dataonly-ec-pool` ? If you have `min_size 3` there, then this is
part of the root cause of the outage here. At the end of this thread,
*only after everything is recovered and no PGs are
undersized/degraded* , you will need to set it `ceph osd pool set
csd-dataonly-ec-pool min_size 4`

4. The immediate goal should be to try to get osd.158 to start up, by
"removing" the corrupted part of PG 85.12 from it.
IF we can get osd.158 started, then the same approach should work for
the other OSDs.
>From your previous log, osd.158 has a broken piece of pg 85.12. Let's
export-remove it:

ceph-objectstore-tool  --data-path /var/lib/ceph/osd/cephpa1-158/
--op export-remove --pgid 85.12s0 > osd.158-85.12s0.bin

Please do that, then try to start osd.158, and report back here.

Two more questions below...

>
> 85.11    39501        0         0       0 165479411712           0
>      0 3000                  stale+active+clean    3d    606021'532631
>    617659:1827554
> [124,157,68,72,102]p124
> [124,157,68,72,102]p124 2022-03-28 07:21:00.566032 2022-03-28
> 07:21:00.566032
> 85.12    39704    39704    158816       0 166350008320           0
>      0 3028 active+undersized+degraded+remapped    3d    606021'573200
>    620336:1839924
> [2147483647,2147483647,2147483647,2147483647,2147483647]p-1
>             [67,91,82,2147483647,112]p67 2022-03-15 03:25:28.478280
> 2022-03-12 19:10:45.866650
> 85.25    39402        0         0       0 165108592640           0
>      0 3098                 stale+down+remapped    3d    606021'521273
>    618930:1734492
> [2147483647,2147483647,2147483647,2147483647,2147483647]p-1
> [2147483647,2147483647,96,2147483647,2147483647]p96 2022-03-15
> 04:08:42.561720 2022-03-09 17:05:34.205121
> 85.33    39319        0         0       0 164740796416           0
>      0 3000                  stale+active+clean    3d    606021'513259
>    617659:2125167
> [174,112,85,102,124]p174
> [174,112,85,102,124]p174 2022-03-28 07:21:12.097873 2022-03-28
> 07:21:12.097873
>
> So 85.11 and 85.33 do not look bad, after all: why are the relevant OSDs
> complaining? Is there a way to force them (OSDs) to forget about the
> chunks they possess, as apparently those have already safely migrated
> elsewhere?
>
> Indeed 85.12 is not really healthy...
> As for chunks of 85.12 and 85.25, the 3 down OSDs have:
> osd.121
>         85.12s3
>         85.25s3
> osd.158
>         85.12s0
> osd.145
>         none
> I guess I can safely purge osd.145 and re-create it, then.

No!!! It contains crucial data for *other* PGs!

>
>
> As for the history of the pool, this is an EC pool with metadata in a
> SSD-backed replicated pool. At some point I realized I had made a
> mistake in the allocation rule for the "data" part, so I changed the
> relevant rule to:
>
> ~]$ ceph --cluster cephpa1 osd lspools | grep 85
> 85 csd-dataonly-ec-pool
> ~]$ ceph --cluster cephpa1 osd pool get csd-dataonly-ec-pool crush_rule
> crush_rule: csd-data-pool
>
> rule csd-data-pool {
>          id 5
>          type erasure
>          min_size 3
>          max_size 5
>          step set_chooseleaf_tries 5
>          step set_choose_tries 100
>          step take default class big
>          step choose indep 0 type host  <--- this was "osd", before
>          step emit
> }

Can you please share the output of `ceph osd tree` ?

We need to understand why crush is not working any more for your pool.

>
> At the time I changed the rule, there was no 'down' PG, all PGs in the
> cluster were 'active' plus possibly some other state (remapped,
> degraded, whatever) as I had added some new disk servers few days before.

Never make crush rule changes when any PG is degraded, remapped, or whatever!
They must all be active+clean to consider big changes like injecting a
new crush rule!!

Cheers, Dan

> The rule change, of course, caused some data movement and after a while
> I found those three OSDs down.
>
>    Thanks!
>
>                         Fulvio
>
>
> On 3/30/22 16:48, Dan van der Ster wrote:
> > Hi Fulvio,
> >
> > I'm not sure why that PG doesn't register.
> > But let's look into your log. The relevant lines are:
> >
> >    -635> 2022-03-30 14:49:57.810 7ff904970700 -1 log_channel(cluster)
> > log [ERR] : 85.12s0 past_intervals [616435,616454) start interval does
> > not contain the required bound [605868,616454) start
> >
> >    -628> 2022-03-30 14:49:57.810 7ff904970700 -1 osd.158 pg_epoch:
> > 616454 pg[85.12s0( empty local-lis/les=0/0 n=0 ec=616435/616435 lis/c
> > 605866/605866 les/c/f 605867/605868/0 616453/616454/616454)
> > [158,168,64,102,156]/[67,91,82,121,112]p67(0) r=-1 lpr=616454
> > pi=[616435,616454)/0 crt=0'0 remapped NOTIFY mbc={}] 85.12s0
> > past_intervals [616435,616454) start interval does not contain the
> > required bound [605868,616454) start
> >
> >    -355> 2022-03-30 14:49:57.816 7ff904970700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
> > In function 'void PG::check_past_interval_bounds() const' thread
> > 7ff904970700 time 2022-03-30 14:49:57.811165
> >
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/gigantic/release/14.2.22/rpm/el7/BUILD/ceph-14.2.22/src/osd/PG.cc:
> > 956: ceph_abort_msg("past_interval start interval mismatch")
> >
> >
> > What is the output of `ceph pg 85.12 query` ?
> >
> > What's the history of that PG? was it moved around recently prior to this 
> > crash?
> > Are the other down osds also hosting broken parts of PG 85.12 ?
> >
> > Cheers, Dan
> >
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: PG down, due to 3 OSD failing

Reply via email to