[ceph-users] Recovering from multiple OSD failures

Aaron Ten Clay Thu, 04 Jun 2015 21:26:33 -0700

Hi Cephers,

I recently had a power problem and the entire cluster was brought down,
came up, went down, and came up again. Afterword, 3 OSDs were mostly dead
(HDD failures). Luckily (I think) the drives were alive enough that I could
copy the data off and leave the journal alone.


Since my pool "data" size is 3... of course a couple of placement groups
were only on those three drives.

Now I've added 4 new OSDs, and everything has recovered, except pg 0.f3.
When I query the pg, I see the cluster is looking for OSD 14 or 23 because
one of them maybe_went_rw. (5, 14, and 23 are now kaput and "ceph osd lost
--yes-i-really-mean-it")

Ceph indicates OSD 29 is now the primary for pg 0.f3. I copied all the data
to the appropriate directory, started OSD.29 again, and here is where my
question comes in:

How do I convince the cluster that it's okay to bring 0.f3 'up' and
backfill to the other OSDs from 29? (I could even manually backfill 15 and
22, but I suspect the cluster will still think there's a problem)

'ceph health detail' shows this about 0.f3:
 pg 0.f3 is incomplete, acting [29,22,15] (reducing pool data min_size from
2 may help; search ceph.com/docs for 'incomplete')

Thanks in advance!
-Aaron

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Recovering from multiple OSD failures

Reply via email to