Just to update this issue. I stopped OSD.6, removed the PG from disk, and restarted it. Ceph rebuilt the object and it went to HEALTH_OK.
During the weekend the disk for OSD.6 started giving smart errors and will be replaced. Thanks for your help Greg. I've opened a bug report in the tracker. On Fri, Dec 12, 2014 at 9:53 PM, Gregory Farnum <g...@gregs42.com> wrote: > > [Re-adding the list] > > Yeah, so "shard 6" means that it's osd.6 which has the bad data. > Apparently pg repair doesn't recover from this class of failures; if > you could file a bug that would be appreciated. > But anyway, if you delete the object in question from OSD 6 and run a > repair on the pg again it should recover just fine. > -Greg > > On Fri, Dec 12, 2014 at 1:45 PM, Luis Periquito <periqu...@gmail.com> > wrote: > > Running firefly 0.80.7 with a replicated pools, with 4 copies. > > > > On 12 Dec 2014 19:20, "Gregory Farnum" <g...@gregs42.com> wrote: > >> > >> What version of Ceph are you running? Is this a replicated or > >> erasure-coded pool? > >> > >> On Fri, Dec 12, 2014 at 1:11 AM, Luis Periquito <periqu...@gmail.com> > >> wrote: > >> > Hi Greg, > >> > > >> > thanks for your help. It's always highly appreciated. :) > >> > > >> > On Thu, Dec 11, 2014 at 6:41 PM, Gregory Farnum <g...@gregs42.com> > >> > wrote: > >> >> > >> >> On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito <periqu...@gmail.com > > > >> >> wrote: > >> >> > Hi, > >> >> > > >> >> > I've stopped OSD.16, removed the PG from the local filesystem and > >> >> > started > >> >> > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a > >> >> > deep-scrub and the PG is still inconsistent. > >> >> > >> >> What led you to remove it from osd 16? Is that the one hosting the > log > >> >> you snipped from? Is osd 16 the one hosting shard 6 of that PG, or > was > >> >> it the primary? > >> > > >> > OSD 16 is both the primary for this PG and the one that has the > snipped > >> > log. > >> > The other 3 OSDs has any mention of this PG in their logs. Just some > >> > messages about slow requests and the backfill when I removed the > object. > >> > Actually it came from OSD.6 - currently we don't have OSD.3. > >> > > >> > this is the output of the pg dump for this PG > >> > 9.180 25614 0 0 0 23306482348 3001 3001 > >> > active+clean+inconsistent 2014-12-10 17:29:01.937929 > 40242'1108124 > >> > 40242:23305321 [16,10,27,6] 16 [16,10,27,6]16 > 40242'1071363 > >> > 2014-12-10 17:29:01.937881 40242'1071363 2014-12-10 > >> > 17:29:01.937881 > >> > > >> >> > >> >> Anyway, the message means that shard 6 (which I think is the seventh > >> >> OSD in the list) of PG 9.180 is missing a bunch of xattrs on object > >> >> 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it > >> >> didn't crash if it's missing the "_" attr.... > >> >> -Greg > >> > > >> > > >> > Any idea on how to fix it? > >> > > >> >> > >> >> > >> >> > > >> >> > I'm running out of ideas on trying to solve this. Does this mean > that > >> >> > all > >> >> > copies of the object should also be inconsistent? Should I just try > >> >> > to > >> >> > figure which object/bucket this belongs to and delete it/copy it > >> >> > again > >> >> > to > >> >> > the ceph cluster? > >> >> > > >> >> > Also, do you know what the error message means? is it just some > sort > >> >> > of > >> >> > metadata for this object that isn't correct, not the object itself? > >> >> > > >> >> > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito > >> >> > <periqu...@gmail.com> > >> >> > wrote: > >> >> >> > >> >> >> Hi, > >> >> >> > >> >> >> In the last few days this PG (pool is .rgw.buckets) has been in > >> >> >> error > >> >> >> after running the scrub process. > >> >> >> > >> >> >> After getting the error, and trying to see what may be the issue > >> >> >> (and > >> >> >> finding none), I've just issued a ceph repair followed by a ceph > >> >> >> deep-scrub. > >> >> >> However it doesn't seem to have fixed the issue and it still > >> >> >> remains. > >> >> >> > >> >> >> The relevant log from the OSD is as follows. > >> >> >> > >> >> >> 2014-12-10 09:38:09.348110 7f8f618be700 0 log [ERR] : 9.180 > >> >> >> deep-scrub > >> >> >> 0 > >> >> >> missing, 1 inconsistent objects > >> >> >> 2014-12-10 09:38:09.348116 7f8f618be700 0 log [ERR] : 9.180 > >> >> >> deep-scrub > >> >> >> 1 > >> >> >> errors > >> >> >> 2014-12-10 10:13:15.922065 7f8f618be700 0 log [INF] : 9.180 > repair > >> >> >> ok, > >> >> >> 0 > >> >> >> fixed > >> >> >> 2014-12-10 10:55:27.556358 7f8f618be700 0 log [ERR] : 9.180 shard > >> >> >> 6: > >> >> >> soid > >> >> >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr > >> >> >> _user.rgw.acl, > >> >> >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag, > >> >> >> missing > >> >> >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing > attr > >> >> >> _user.rgw.x-amz-meta-md5sum, missing attr > _user.rgw.x-amz-meta-stat, > >> >> >> missing > >> >> >> attr snapset > >> >> >> 2014-12-10 10:56:50.597952 7f8f618be700 0 log [ERR] : 9.180 > >> >> >> deep-scrub > >> >> >> 0 > >> >> >> missing, 1 inconsistent objects > >> >> >> 2014-12-10 10:56:50.597957 7f8f618be700 0 log [ERR] : 9.180 > >> >> >> deep-scrub > >> >> >> 1 > >> >> >> errors > >> >> >> > >> >> >> I'm running version firefly 0.80.7. > >> >> > > >> >> > > >> >> > > >> >> > _______________________________________________ > >> >> > ceph-users mailing list > >> >> > ceph-users@lists.ceph.com > >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > >> >> > > >> > > >> > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com