Re: [ceph-users] objects degraded higher than 100%

Gregory Farnum Thu, 12 Oct 2017 10:23:29 -0700

On Thu, Oct 12, 2017 at 3:50 AM Florian Haas <flor...@hastexo.com> wrote:


> On Mon, Sep 11, 2017 at 8:13 PM, Andreas Herrmann <andr...@mx20.org>
> wrote:
> > Hi,
> >
> > how could this happen:
> >
> >         pgs: 197528/1524 objects degraded (12961.155%)
> >
> > I did some heavy failover tests, but a value higher than 100% looks
> strange
> > (ceph version 12.2.0). Recovery is quite slow.
> >
> >   cluster:
> >     health: HEALTH_WARN
> >             3/1524 objects misplaced (0.197%)
> >             Degraded data redundancy: 197528/1524 objects degraded
> > (12961.155%), 1057 pgs unclean, 1055 pgs degraded, 3 pgs undersized
> >
> >   data:
> >     pools:   1 pools, 2048 pgs
> >     objects: 508 objects, 1467 MB
> >     usage:   127 GB used, 35639 GB / 35766 GB avail
> >     pgs:     197528/1524 objects degraded (12961.155%)
> >              3/1524 objects misplaced (0.197%)
> >              1042 active+recovery_wait+degraded
> >              991  active+clean
> >              8    active+recovering+degraded
> >              3    active+undersized+degraded+remapped+backfill_wait
> >              2    active+recovery_wait+degraded+remapped
> >              2    active+remapped+backfill_wait
> >
> >   io:
> >     recovery: 340 kB/s, 80 objects/s
>
> Did you ever get to the bottom of this? I'm seeing something very
> similar on a 12.2.1 reference system:
>
> https://gist.github.com/fghaas/f547243b0f7ebb78ce2b8e80b936e42c
>
> I'm also seeing an unusual MISSING_ON_PRIMARY count in "rados df":
> https://gist.github.com/fghaas/59cd2c234d529db236c14fb7d46dfc85
>
> The odd thing in there is that the "bench" pool was empty when the
> recovery started (that pool had been wiped with "rados cleanup"), so
> the number of objects deemed to be missing from the primary really
> ought to be zero.
>
> It seems like it's considering these deleted objects to still require
> replication, but that sounds rather far fetched to be honest.
>

Actually, that makes some sense. This cluster had an OSD down while (some
of) the deletes were happening?

I haven't dug through the code but I bet it is considering those as
degraded objects because the out-of-date OSD knows it doesn't have the
latest versions on them! :)
-Greg

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] objects degraded higher than 100%

Reply via email to