Re: [ceph-users] unable to repair PG

Luis Periquito Mon, 15 Dec 2014 01:26:38 -0800

Just to update this issue.

I stopped OSD.6, removed the PG from disk, and restarted it. Ceph rebuilt
the object and it went to HEALTH_OK.


During the weekend the disk for OSD.6 started giving smart errors and will
be replaced.

Thanks for your help Greg. I've opened a bug report in the tracker.

On Fri, Dec 12, 2014 at 9:53 PM, Gregory Farnum <g...@gregs42.com> wrote:
>
> [Re-adding the list]
>
> Yeah, so "shard 6" means that it's osd.6 which has the bad data.
> Apparently pg repair doesn't recover from this class of failures; if
> you could file a bug that would be appreciated.
> But anyway, if you delete the object in question from OSD 6 and run a
> repair on the pg again it should recover just fine.
> -Greg
>
> On Fri, Dec 12, 2014 at 1:45 PM, Luis Periquito <periqu...@gmail.com>
> wrote:
> > Running firefly 0.80.7 with a replicated pools, with 4 copies.
> >
> > On 12 Dec 2014 19:20, "Gregory Farnum" <g...@gregs42.com> wrote:
> >>
> >> What version of Ceph are you running? Is this a replicated or
> >> erasure-coded pool?
> >>
> >> On Fri, Dec 12, 2014 at 1:11 AM, Luis Periquito <periqu...@gmail.com>
> >> wrote:
> >> > Hi Greg,
> >> >
> >> > thanks for your help. It's always highly appreciated. :)
> >> >
> >> > On Thu, Dec 11, 2014 at 6:41 PM, Gregory Farnum <g...@gregs42.com>
> >> > wrote:
> >> >>
> >> >> On Thu, Dec 11, 2014 at 2:57 AM, Luis Periquito <periqu...@gmail.com
> >
> >> >> wrote:
> >> >> > Hi,
> >> >> >
> >> >> > I've stopped OSD.16, removed the PG from the local filesystem and
> >> >> > started
> >> >> > the OSD again. After ceph rebuilt the PG in the removed OSD I ran a
> >> >> > deep-scrub and the PG is still inconsistent.
> >> >>
> >> >> What led you to remove it from osd 16? Is that the one hosting the
> log
> >> >> you snipped from? Is osd 16 the one hosting shard 6 of that PG, or
> was
> >> >> it the primary?
> >> >
> >> > OSD 16 is both the primary for this PG and the one that has the
> snipped
> >> > log.
> >> > The other 3 OSDs has any mention of this PG in their logs. Just some
> >> > messages about slow requests and the backfill when I removed the
> object.
> >> > Actually it came from OSD.6 - currently we don't have OSD.3.
> >> >
> >> > this is the output of the pg dump for this PG
> >> > 9.180    25614    0    0    0    23306482348    3001    3001
> >> > active+clean+inconsistent    2014-12-10 17:29:01.937929
> 40242'1108124
> >> > 40242:23305321    [16,10,27,6]    16    [16,10,27,6]16
> 40242'1071363
> >> > 2014-12-10 17:29:01.937881    40242'1071363    2014-12-10
> >> > 17:29:01.937881
> >> >
> >> >>
> >> >> Anyway, the message means that shard 6 (which I think is the seventh
> >> >> OSD in the list) of PG 9.180 is missing a bunch of xattrs on object
> >> >> 370cbf80/29145.4_xxx/head//9. I'm actually a little surprised it
> >> >> didn't crash if it's missing the "_" attr....
> >> >> -Greg
> >> >
> >> >
> >> > Any idea on how to fix it?
> >> >
> >> >>
> >> >>
> >> >> >
> >> >> > I'm running out of ideas on trying to solve this. Does this mean
> that
> >> >> > all
> >> >> > copies of the object should also be inconsistent? Should I just try
> >> >> > to
> >> >> > figure which object/bucket this belongs to and delete it/copy it
> >> >> > again
> >> >> > to
> >> >> > the ceph cluster?
> >> >> >
> >> >> > Also, do you know what the error message means? is it just some
> sort
> >> >> > of
> >> >> > metadata for this object that isn't correct, not the object itself?
> >> >> >
> >> >> > On Wed, Dec 10, 2014 at 11:11 AM, Luis Periquito
> >> >> > <periqu...@gmail.com>
> >> >> > wrote:
> >> >> >>
> >> >> >> Hi,
> >> >> >>
> >> >> >> In the last few days this PG (pool is .rgw.buckets) has been in
> >> >> >> error
> >> >> >> after running the scrub process.
> >> >> >>
> >> >> >> After getting the error, and trying to see what may be the issue
> >> >> >> (and
> >> >> >> finding none), I've just issued a ceph repair followed by a ceph
> >> >> >> deep-scrub.
> >> >> >> However it doesn't seem to have fixed the issue and it still
> >> >> >> remains.
> >> >> >>
> >> >> >> The relevant log from the OSD is as follows.
> >> >> >>
> >> >> >> 2014-12-10 09:38:09.348110 7f8f618be700  0 log [ERR] : 9.180
> >> >> >> deep-scrub
> >> >> >> 0
> >> >> >> missing, 1 inconsistent objects
> >> >> >> 2014-12-10 09:38:09.348116 7f8f618be700  0 log [ERR] : 9.180
> >> >> >> deep-scrub
> >> >> >> 1
> >> >> >> errors
> >> >> >> 2014-12-10 10:13:15.922065 7f8f618be700  0 log [INF] : 9.180
> repair
> >> >> >> ok,
> >> >> >> 0
> >> >> >> fixed
> >> >> >> 2014-12-10 10:55:27.556358 7f8f618be700  0 log [ERR] : 9.180 shard
> >> >> >> 6:
> >> >> >> soid
> >> >> >> 370cbf80/29145.4_xxx/head//9 missing attr _, missing attr
> >> >> >> _user.rgw.acl,
> >> >> >> missing attr _user.rgw.content_type, missing attr _user.rgw.etag,
> >> >> >> missing
> >> >> >> attr _user.rgw.idtag, missing attr _user.rgw.manifest, missing
> attr
> >> >> >> _user.rgw.x-amz-meta-md5sum, missing attr
> _user.rgw.x-amz-meta-stat,
> >> >> >> missing
> >> >> >> attr snapset
> >> >> >> 2014-12-10 10:56:50.597952 7f8f618be700  0 log [ERR] : 9.180
> >> >> >> deep-scrub
> >> >> >> 0
> >> >> >> missing, 1 inconsistent objects
> >> >> >> 2014-12-10 10:56:50.597957 7f8f618be700  0 log [ERR] : 9.180
> >> >> >> deep-scrub
> >> >> >> 1
> >> >> >> errors
> >> >> >>
> >> >> >> I'm running version firefly 0.80.7.
> >> >> >
> >> >> >
> >> >> >
> >> >> > _______________________________________________
> >> >> > ceph-users mailing list
> >> >> > ceph-users@lists.ceph.com
> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >> >
> >> >
> >> >
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] unable to repair PG

Reply via email to