When I had PGs stuck with down_osds_we_would_probe, there was no way I could convince Ceph to give up on the data while those OSDs were down.
I tried ceph osd lost, ceph pg mark_unfound_lost, ceph pg force_create_pg. None of them would do anything. I eventually re-formatted the down OSD, and brought it back online. It started backfilling, and the down_osds_we_would_probe emptied out. Once that happened, ceph pg force_create_pg finally worked. It didn't work right away though... if I recall, the PGs went into the creating state, and stayed there for many hours. They finally created when another OSD restarted. On Tue, Oct 21, 2014 at 9:59 AM, Lincoln Bryant <linco...@uchicago.edu> wrote: > A small update on this, I rebooted all of the Ceph nodes and was able to > then query one of the misbehaving pgs. > > I've attached the query for pg 2.525. > > > > > There are some things like this in the peer info: > > "up": [], > "acting": [], > "up_primary": -1, > "acting_primary": -1}, > > > I also see things like: > "down_osds_we_would_probe": [ > 85], > > But I don't have an OSD 85: > 85 3.64 osd.85 DNE > > # ceph osd rm osd.85 > osd.85 does not exist. > # ceph osd lost 85 --yes-i-really-mean-it > osd.85 is not down or doesn't exist > > Any help would be greatly appreciated. > > Thanks, > Lincoln > > On Oct 21, 2014, at 9:39 AM, Lincoln Bryant wrote: > > > Hi cephers, > > > > We have two pgs that are stuck in 'incomplete' state across two > different pools: > > pg 2.525 is stuck inactive since forever, current state incomplete, last > acting [55,89] > > pg 0.527 is stuck inactive since forever, current state incomplete, last > acting [55,89] > > pg 0.527 is stuck unclean since forever, current state incomplete, last > acting [55,89] > > pg 2.525 is stuck unclean since forever, current state incomplete, last > acting [55,89] > > pg 0.527 is incomplete, acting [55,89] > > pg 2.525 is incomplete, acting [55,89] > > > > Basically, we ran into a problem where we had 2x replication and 2 disks > on different machines died near-simultaneously, and my pgs were stuck in > 'down+peering'. I had to do some combination of declaring the OSDs as lost, > and running 'force_create_pg'. I realize the data on those pgs is now lost, > but I'm stuck as to how to get the pgs out of 'incomplete'. > > > > I also see many ops blocked on the primary OSD for these: > > 100 ops are blocked > 67108.9 sec > > 100 ops are blocked > 67108.9 sec on osd.55 > > > > However, this is a new disk. If I 'ceph osd out osd.55', the pgs move to > another OSD and the new primary gets blocked ops. Restarting osd.55 does > nothing. Other pgs on osd.55 seem okay. > > > > I would attach the result of a query, but If I run a 'ceph pg 2.525 > query', the command totally hangs until I ctrl-c > > > > ceph pg 2.525 query > > ^CError EINTR: problem getting command descriptions from pg.2.525 > > > > I've also tried 'ceph pg repair 2.525', which does nothing. > > > > Any thoughts here? Are my pools totally sunk? > > > > Thanks, > > Lincoln > > _______________________________________________ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com