When I had PGs stuck with down_osds_we_would_probe, there was no way I
could convince Ceph to give up on the data while those OSDs were down.

I tried ceph osd lost, ceph pg mark_unfound_lost, ceph pg force_create_pg.
None of them would do anything.

I eventually re-formatted the down OSD, and brought it back online.  It
started backfilling, and the down_osds_we_would_probe emptied out.  Once
that happened, ceph pg force_create_pg finally worked.  It didn't work
right away though... if I recall, the PGs went into the creating state, and
stayed there for many hours.  They finally created when another OSD
restarted.




On Tue, Oct 21, 2014 at 9:59 AM, Lincoln Bryant <linco...@uchicago.edu>
wrote:

> A small update on this, I rebooted all of the Ceph nodes and was able to
> then query one of the misbehaving pgs.
>
> I've attached the query for pg 2.525.
>
>
>
>
> There are some things like this in the peer info:
>
>               "up": [],
>               "acting": [],
>               "up_primary": -1,
>               "acting_primary": -1},
>
>
> I also see things like:
>           "down_osds_we_would_probe": [
>                 85],
>
> But I don't have an OSD 85:
>         85      3.64                    osd.85  DNE
>
> # ceph osd rm osd.85
> osd.85 does not exist.
> # ceph osd lost 85 --yes-i-really-mean-it
> osd.85 is not down or doesn't exist
>
> Any help would be greatly appreciated.
>
> Thanks,
> Lincoln
>
> On Oct 21, 2014, at 9:39 AM, Lincoln Bryant wrote:
>
> > Hi cephers,
> >
> > We have two pgs that are stuck in 'incomplete' state across two
> different pools:
> > pg 2.525 is stuck inactive since forever, current state incomplete, last
> acting [55,89]
> > pg 0.527 is stuck inactive since forever, current state incomplete, last
> acting [55,89]
> > pg 0.527 is stuck unclean since forever, current state incomplete, last
> acting [55,89]
> > pg 2.525 is stuck unclean since forever, current state incomplete, last
> acting [55,89]
> > pg 0.527 is incomplete, acting [55,89]
> > pg 2.525 is incomplete, acting [55,89]
> >
> > Basically, we ran into a problem where we had 2x replication and 2 disks
> on different machines died near-simultaneously, and my pgs were stuck in
> 'down+peering'. I had to do some combination of declaring the OSDs as lost,
> and running 'force_create_pg'. I realize the data on those pgs is now lost,
> but I'm stuck as to how to get the pgs out of 'incomplete'.
> >
> > I also see many ops blocked on the primary OSD for these:
> > 100 ops are blocked > 67108.9 sec
> > 100 ops are blocked > 67108.9 sec on osd.55
> >
> > However, this is a new disk. If I 'ceph osd out osd.55', the pgs move to
> another OSD and the new primary gets blocked ops. Restarting osd.55 does
> nothing. Other pgs on osd.55 seem okay.
> >
> > I would attach the result of a query, but If I run a 'ceph pg 2.525
> query', the command totally hangs until I ctrl-c
> >
> > ceph pg 2.525 query
> > ^CError EINTR: problem getting command descriptions from pg.2.525
> >
> > I've also tried 'ceph pg repair 2.525', which does nothing.
> >
> > Any thoughts here? Are my pools totally sunk?
> >
> > Thanks,
> > Lincoln
> > _______________________________________________
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to