On Wed, May 11, 2016 at 6:53 PM, <george.vasilaka...@stfc.ac.uk> wrote: > Hey Dan, > > This is on Hammer 0.94.5. osd.52 was always on a problematic machine and when > this happened had less data on its local disk than the other OSDs. I've tried > adapting that blog post's solution to this situation to no avail.
Do you have a log of what you did and why it didn't work? I guess the solution to your issue lies in a version of that procedure. -- dan > > I've tried things like looking at all probing OSDs in the query output and > importing the data from one copy to all of them to get it to be consistent. > One of the major red flags here was that when I looked at the original acting > set's disks I found each OSD had a different amount of data for the same PG, > there is at least one PG here where 52 (the primary for all four) actually > had about 1GB (~27%) less data, everything has just been really inconsistent. > > Here's hoping Cunningham will come to the rescue. > > Cheers, > > George > > ________________________________________ > From: Dan van der Ster [d...@vanderster.com] > Sent: 11 May 2016 17:28 > To: Vasilakakos, George (STFC,RAL,SC) > Cc: ceph-users@lists.ceph.com > Subject: Re: [ceph-users] Incomplete PGs, how do I get them back without data > loss? > > Hi George, > > Which version of Ceph is this? > I've never had incompete pgs stuck like this before. AFAIK it means > that osd.52 would need to be brought up before you can restore those > PGs. > > Perhaps you'll need ceph-objectstore-tool to help dump osd.52 and > bring up its data elsewhere. A quick check on this list pointed to > https://ceph.com/community/incomplete-pgs-oh-my/ -- did you try that? > > Or perhaps I'm spewing enough nonsense here that Cunningham's Law will > bring you the solution. > > Cheers, Dan > > > > On Thu, May 5, 2016 at 8:21 PM, <george.vasilaka...@stfc.ac.uk> wrote: >> Hi folks, >> >> I've got a serious issue with a Ceph cluster that's used for RBD. >> >> There are 4 PGs stuck in an incomplete state and I'm trying to repair this >> problem to no avail. >> >> Here's ceph status: >> health HEALTH_WARN >> 4 pgs incomplete >> 4 pgs stuck inactive >> 4 pgs stuck unclean >> 100 requests are blocked > 32 sec >> monmap e13: 3 mons at ... >> election epoch 2084, quorum 0,1,2 mon4,mon5,mon3 >> osdmap e154083: 203 osds: 197 up, 197 in >> pgmap v37369382: 9856 pgs, 5 pools, 20932 GB data, 22321 kobjects >> 64871 GB used, 653 TB / 716 TB avail >> 9851 active+clean >> 4 incomplete >> 1 active+clean+scrubbing >> >> The 4 PGs all have the same primary OSD, which is on a host that had its >> OSDs turned off as it was quite flaky. >> >> 1.1bdb incomplete [52,100,130] 52 [52,100,130] 52 >> 1.5c2 incomplete [52,191,109] 52 [52,191,109] 52 >> 1.f98 incomplete [52,92,37] 52 [52,92,37] 52 >> 1.11dc incomplete [52,176,12] 52 [52,176,12] 52 >> >> One thing that strikes me as odd is that once osd.52 is taken out, these >> sets change completely. >> The situation currently is that, for each of these PGs, the three OSDs have >> different amounts of data. >> They all have similar but different amounts, with osd.52 having the smallest >> amount (not by too much though) in each case. >> >> Querying those PGs doesn't return a response after a few minutes, manually >> triggering scrubs or repairs on them does nothing. >> I've lowered the min_size from 2 to 1 but I'm not seeing any activity to fix >> this. >> >> Is there something that can be done to recover without losing that data (it >> means each VM has a 75% chance of being destroyed)? >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com