Congrats Chris and nice "save" on that RBD! -- Paul
> On Apr 9, 2015, at 11:11 AM, Chris Kitzmiller <ckitzmil...@hampshire.edu> > wrote: > > Success! Hopefully my notes from the process will help: > > In the event of multiple disk failures the cluster could lose PGs. Should > this occur it is best to attempt to restart the OSD process and have the > drive marked as up+out. Marking the drive as out will cause data to flow off > the drive to elsewhere in the cluster. In the event that the ceph-osd process > is unable to keep running you could try using the ceph_objectstore_tool > program to extract just the damaged PGs and import them into working PGs. > > Fixing Journals > In this particular scenario things were complicated by the fact that > ceph_objectstore_tool came out in Giant but we were running Firefly. Not > wanting to upgrade the cluster in a degraded state this required that the OSD > drives be moved to a different physical machine for repair. This added a lot > of steps related to the journals but it wasn't a big deal. That process looks > like: > > On Storage1: > stop ceph-osd id=15 > ceph-osd -i 15 --flush-journal > ls -l /var/lib/ceph/osd/ceph-15/journal > > Note the journal device UUID then pull the disk and move it to Ithome: > rm /var/lib/ceph/osd/ceph-15/journal > ceph-osd -i 15 --mkjournal > > That creates a colocated journal for which to use during the > ceph_objectstore_tool commands. Once done then: > ceph-osd -i 15 --flush-journal > rm /var/lib/ceph/osd/ceph-15/journal > > Pull the disk and bring it back to Storage1. Then: > ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f > /var/lib/ceph/osd/ceph-15/journal > ceph-osd -i 15 --mkjournal > start ceph-osd id=15 > > This all won't be needed once the cluster is running Hammer because then > there will be an available version of ceph_objectstore_tool on the local > machine and you can keep the journals throughout the process. > > > Recovery Process > We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and > OSD.15 which were the two disks which failed out of Storage1. The disk for > OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more > cooperative but not in a place to be up and running in the cluster. I took > the dying OSD.15 drive and placed it into a new physical machine with a fresh > install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to > extract the PGs with a command like: > for i in 3.c7 3.102 ; do ceph_objectstore_tool --data > /var/lib/ceph/osd/ceph-15 --journal /var/lib/ceph/osd/ceph-15/journal --op > export --pgid $i --file ~/${i}.export > > Once both PGs were successfully exported I attempted to import them into a > new temporary OSD following instructions from here. For some reason that > didn't work. The OSD was up+in but wasn't backfilling the PGs into the > cluster. If you find yourself in this process I would try that first just in > case it provides a cleaner process. > Considering the above didn't work and we were looking at the possibility of > losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking > 35TB) I took what I might describe as heroic measures: > > Running > ceph pg dump | grep incomplete > > 3.c7 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.968841 0'0 > 15730:17 [15,0] 15 [15,0] 15 13985'54076 2015-03-31 19:14:22.721695 > 13985'54076 2015-03-31 19:14:22.721695 > 3.102 0 0 0 0 0 0 0 incomplete 2015-04-02 20:49:32.529594 0'0 > 15730:21 [0,15] 0 [0,15] 0 13985'53107 2015-03-29 21:17:15.568125 > 13985'49195 2015-03-24 18:38:08.244769 > > Then I stopped all OSDs, which blocked all I/O to the cluster, with: > stop ceph-osd-all > > Then I looked for all copies of the PG on all OSDs with: > for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name > "$i" ; done | sort -V > > /var/lib/ceph/osd/ceph-0/current/3.c7_head > /var/lib/ceph/osd/ceph-0/current/3.102_head > /var/lib/ceph/osd/ceph-3/current/3.c7_head > /var/lib/ceph/osd/ceph-13/current/3.102_head > /var/lib/ceph/osd/ceph-15/current/3.c7_head > /var/lib/ceph/osd/ceph-15/current/3.102_head > > Then I flushed the journals for all of those OSDs with: > for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done > > Then I removed all of those drives and moved them (using Journal Fixing > above) to Ithome where I used ceph_objectstore_tool to remove all traces of > 3.102 and 3.c7: > for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data > /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op > remove --pgid $j ; done ; done > > Then I imported the PGs onto OSD.0 and OSD.15 with: > for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data > /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op > import --file ~/${j}.export ; done ; done > for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm > /var/log/ceph/osd/ceph-$i/journal ; done > > Then I moved the disks back to Storage1 and started them all back up again. I > think that this should have worked but what happened in this case was that > OSD.0 didn't start up for some reason. I initially thought that that wouldn't > matter because OSD.15 did start and so we should have had everything but a > ceph pg query of the PGs showed something like: > "blocked": "peering is blocked due to down osds", > "down_osds_we_would_probe": [0], > "peering_blocked_by": [{ > "osd": 0, > "current_lost_at": 0, > "comment": "starting or marking this osd lost may let us proceed" > }] > > So I then removed OSD.0 from the cluster and everything came back to life. > Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans! _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com