[ceph-users] Recover data from cluster / get rid of down, incomplete, unknown pgs

Dylan Jones Tue, 02 Oct 2018 11:14:49 -0700

Our ceph cluster stopped responding to requests two weeks ago, and I have
been trying to fix it since then.  After a semi-hard reboot, we had 11-ish
OSDs "fail" spread across two hosts, with the pool size set to two.  I was
able to extract a copy of every PG that resided solely on the nonfunctional
OSDs, but the cluster is refusing to let me read from it.  I marked all the
"failed" OSDs as lost and used ceph pg $pg mark_unfound_lost revert for all
the PGs reporting unfound objects, but that didn't help either.  ddrescue
also breaks, because ceph will never admit that it has lost data and just
blocks forever instead of returning a read error.
Is there any way to tell ceph to cut its losses and just let me access my
data again?



  cluster:
    id:     313be153-5e8a-4275-b3aa-caea1ce7bce2
    health: HEALTH_ERR
            noout,nobackfill,norebalance flag(s) set
            2720183/6369036 objects misplaced (42.709%)
            9/3184518 objects unfound (0.000%)
            39 scrub errors
            Reduced data availability: 131 pgs inactive, 16 pgs down, 114
pgs incomplete
            Possible data damage: 7 pgs recovery_unfound, 1 pg
inconsistent, 7 pgs snaptrim_error
            Degraded data redundancy: 1710175/6369036 objects degraded
(26.851%), 1069 pgs degraded, 1069 pgs undersized
            Degraded data redundancy (low space): 82 pgs backfill_toofull

  services:
    mon: 1 daemons, quorum waitaha
    mgr: waitaha(active)
    osd: 43 osds: 34 up, 34 in; 1786 remapped pgs
         flags noout,nobackfill,norebalance

  data:
    pools:   2 pools, 2048 pgs
    objects: 3.18 M objects, 8.4 TiB
    usage:   21 TiB used, 60 TiB / 82 TiB avail
    pgs:     0.049% pgs unknown
             6.348% pgs not active
             1710175/6369036 objects degraded (26.851%)
             2720183/6369036 objects misplaced (42.709%)
             9/3184518 objects unfound (0.000%)
             987 active+undersized+degraded+remapped+backfill_wait
             695 active+remapped+backfill_wait
             124 active+clean
             114 incomplete
             62
active+undersized+degraded+remapped+backfill_wait+backfill_toofull
             20  active+remapped+backfill_wait+backfill_toofull
             16  down
             12  active+undersized+degraded+remapped+backfilling
             7   active+recovery_unfound+undersized+degraded+remapped
             7   active+clean+snaptrim_error
             2   active+remapped+backfilling
             1   unknown
             1
active+undersized+degraded+remapped+inconsistent+backfill_wait

Thanks,
Dylan

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Recover data from cluster / get rid of down, incomplete, unknown pgs

Reply via email to