Re: [ceph-users] Recover data from cluster / get rid of down, incomplete, unknown pgs

Gregory Farnum Wed, 03 Oct 2018 08:40:09 -0700

If you've really extracted all the PGs from the down OSDs, you should have
been able to inject them into new OSDs and continue on from there with just
rebalancing activity. The use of mark_unfound_lost_revert complicates
matters a bit but I'm not sure what the behavior would be if you just put
them in place now.
Really though you'll need to find out why the incomplete PGs are marked
that way — it might be that some of the live OSDs are aware of writes
they've missed but think ought to still be in the cluster somewhere. Also
which version are you running? It might be that some of the OSDs are
failing to map correctly onto enough OSDs since a quarter of them are dead.
-Greg


On Tue, Oct 2, 2018 at 11:14 AM Dylan Jones <dylanjones2...@gmail.com>
wrote:

> Our ceph cluster stopped responding to requests two weeks ago, and I have
> been trying to fix it since then.  After a semi-hard reboot, we had 11-ish
> OSDs "fail" spread across two hosts, with the pool size set to two.  I was
> able to extract a copy of every PG that resided solely on the nonfunctional
> OSDs, but the cluster is refusing to let me read from it.  I marked all the
> "failed" OSDs as lost and used ceph pg $pg mark_unfound_lost revert for
> all the PGs reporting unfound objects, but that didn't help either.
> ddrescue also breaks, because ceph will never admit that it has lost data
> and just blocks forever instead of returning a read error.
> Is there any way to tell ceph to cut its losses and just let me access my
> data again?
>
>
>   cluster:
>     id:     313be153-5e8a-4275-b3aa-caea1ce7bce2
>     health: HEALTH_ERR
>             noout,nobackfill,norebalance flag(s) set
>             2720183/6369036 objects misplaced (42.709%)
>             9/3184518 objects unfound (0.000%)
>             39 scrub errors
>             Reduced data availability: 131 pgs inactive, 16 pgs down, 114
> pgs incomplete
>             Possible data damage: 7 pgs recovery_unfound, 1 pg
> inconsistent, 7 pgs snaptrim_error
>             Degraded data redundancy: 1710175/6369036 objects degraded
> (26.851%), 1069 pgs degraded, 1069 pgs undersized
>             Degraded data redundancy (low space): 82 pgs backfill_toofull
>
>   services:
>     mon: 1 daemons, quorum waitaha
>     mgr: waitaha(active)
>     osd: 43 osds: 34 up, 34 in; 1786 remapped pgs
>          flags noout,nobackfill,norebalance
>
>   data:
>     pools:   2 pools, 2048 pgs
>     objects: 3.18 M objects, 8.4 TiB
>     usage:   21 TiB used, 60 TiB / 82 TiB avail
>     pgs:     0.049% pgs unknown
>              6.348% pgs not active
>              1710175/6369036 objects degraded (26.851%)
>              2720183/6369036 objects misplaced (42.709%)
>              9/3184518 objects unfound (0.000%)
>              987 active+undersized+degraded+remapped+backfill_wait
>              695 active+remapped+backfill_wait
>              124 active+clean
>              114 incomplete
>              62
> active+undersized+degraded+remapped+backfill_wait+backfill_toofull
>              20  active+remapped+backfill_wait+backfill_toofull
>              16  down
>              12  active+undersized+degraded+remapped+backfilling
>              7   active+recovery_unfound+undersized+degraded+remapped
>              7   active+clean+snaptrim_error
>              2   active+remapped+backfilling
>              1   unknown
>              1
> active+undersized+degraded+remapped+inconsistent+backfill_wait
>
> Thanks,
> Dylan
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Recover data from cluster / get rid of down, incomplete, unknown pgs

Reply via email to