Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

Gregory Farnum Wed, 18 Oct 2017 14:14:40 -0700

I've created a ticket http://tracker.ceph.com/issues/21833


Hopefully we can work this out.

On Mon, Oct 16, 2017 at 6:03 PM Dejan Lesjak <dejan.les...@ijs.si> wrote:

>
> > On 17. okt. 2017, at 00:59, Gregory Farnum <gfar...@redhat.com> wrote:
> >
> > On Mon, Oct 16, 2017 at 3:49 PM Dejan Lesjak <dejan.les...@ijs.si>
> wrote:
> >
> > > On 17. okt. 2017, at 00:23, Gregory Farnum <gfar...@redhat.com> wrote:
> > >
> > > On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak <dejan.les...@ijs.si>
> wrote:
> > > On 10/16/2017 02:02 PM, Dejan Lesjak wrote:
> > > > Hi,
> > > >
> > > > During rather high load and rebalancing, a couple of our OSDs crashed
> > > > and they fail to start. This is from the log:
> > > >
> > > >     -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123
> load_pgs
> > > > opened 370 pgs
> > > >     -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
> > > > build_past_intervals_parallel over 439159-439159
> > > >      0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
> > > >
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > > > In function 'void OSD::build_past_intervals_parallel()' thread
> > > > 7f5e4c3bae80 time 2017-10-16 13:27:50.260062
> > > >
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > > > 4177: FAILED assert(p.same_interval_since)
> > > >
> > > >  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
> luminous
> > > > (stable)
> > > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > const*)+0x102) [0x55e4caa18592]
> > > >  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
> > > >  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
> > > >  4: (OSD::init()+0x2227) [0x55e4ca467327]
> > > >  5: (main()+0x2d5a) [0x55e4ca379b1a]
> > > >  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
> > > >  7: (_start()+0x2a) [0x55e4ca4039aa]
> > > >  NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> > > > needed to interpret this.
> > > >
> > > > Does anybody know how to fix or further debug this?
> > >
> > > Bumped logging to 10 and posted log to
> https://pastebin.com/raw/StTeYWRt
> > > From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
> > > 10.1fce. Yet pg map doesn't show osd.1 for this pg:
> > >
> > > # ceph pg map 10.1fce
> > > osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
> > > [110,213,132,182]
> > >
> > > Hmm, this is odd. What caused your rebalancing exactly? Can you turn
> on the OSD with debugging set to 20, and then upload the log file using
> ceph-post-file?
> > >
> > > The specific assert you're hitting here is supposed to cope with PGs
> that have been imported (via the ceph-objectstore-tool). But obviously
> something has gone wrong here.
> >
> > It started when we bumped the number of PGs for a pool (from 2048 to
> 8192).
> > I’ve sent the log with ID 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba
> >
> > It actually seems similar than http://tracker.ceph.com/issues/21142 in
> that the pg found in log seems empty if checked with ceph-objectstore-tool
> and removing it allows the osd to start. At least on one osd, I’ve not
> tried that yet on all of the failed ones.
> >
> > Ah. I bet we are default-constructing the "child" PGs from split with
> that value set to zero, so it's incorrectly being flagged for later use.
> David, does that make sense to you? Do you think it's reasonable to fix it
> by just checking for other default-initialized values as part of that
> branch check?
> > (I note that this code got removed once Luminous branched, so hopefully
> there's a simple fix we can apply!)
> >
> > Dejan, did you make sure the OSD you tried that on has re-created the
> removed PG and populated it with data? If so I think you ought to be fine
> removing any empty PGs which are causing this assert.
>
> Well, after a while apparently the pg does get recreated on osd, but
> unfortunately the assert happens again.
>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

Reply via email to