Congrats Chris and nice "save" on that RBD!

--
Paul 

> On Apr 9, 2015, at 11:11 AM, Chris Kitzmiller <ckitzmil...@hampshire.edu> 
> wrote:
> 
> Success! Hopefully my notes from the process will help:
> 
> In the event of multiple disk failures the cluster could lose PGs. Should 
> this occur it is best to attempt to restart the OSD process and have the 
> drive marked as up+out. Marking the drive as out will cause data to flow off 
> the drive to elsewhere in the cluster. In the event that the ceph-osd process 
> is unable to keep running you could try using the ceph_objectstore_tool 
> program to extract just the damaged PGs and import them into working PGs.
> 
> Fixing Journals
> In this particular scenario things were complicated by the fact that 
> ceph_objectstore_tool came out in Giant but we were running Firefly. Not 
> wanting to upgrade the cluster in a degraded state this required that the OSD 
> drives be moved to a different physical machine for repair. This added a lot 
> of steps related to the journals but it wasn't a big deal. That process looks 
> like:
> 
> On Storage1:
> stop ceph-osd id=15
> ceph-osd -i 15 --flush-journal
> ls -l /var/lib/ceph/osd/ceph-15/journal
> 
> Note the journal device UUID then pull the disk and move it to Ithome:
> rm /var/lib/ceph/osd/ceph-15/journal
> ceph-osd -i 15 --mkjournal
> 
> That creates a colocated journal for which to use during the 
> ceph_objectstore_tool commands. Once done then:
> ceph-osd -i 15 --flush-journal
> rm /var/lib/ceph/osd/ceph-15/journal
> 
> Pull the disk and bring it back to Storage1. Then:
> ln -s /dev/disk/by-partitionuuid/b4f8d911-5ac9-4bf0-a06a-b8492e25a00f 
> /var/lib/ceph/osd/ceph-15/journal
> ceph-osd -i 15 --mkjournal
> start ceph-osd id=15
> 
> This all won't be needed once the cluster is running Hammer because then 
> there will be an available version of ceph_objectstore_tool on the local 
> machine and you can keep the journals throughout the process.
> 
> 
> Recovery Process
> We were missing two PGs, 3.c7 and 3.102. These PGs were hosted on OSD.0 and 
> OSD.15 which were the two disks which failed out of Storage1. The disk for 
> OSD.0 seemed to be a total loss while the disk for OSD.15 was somewhat more 
> cooperative but not in a place to be up and running in the cluster. I took 
> the dying OSD.15 drive and placed it into a new physical machine with a fresh 
> install of Ceph Giant. Using Giant's ceph_objectstore_tool I was able to 
> extract the PGs with a command like:
> for i in 3.c7 3.102 ; do ceph_objectstore_tool --data 
> /var/lib/ceph/osd/ceph-15 --journal /var/lib/ceph/osd/ceph-15/journal --op 
> export --pgid $i --file ~/${i}.export
> 
> Once both PGs were successfully exported I attempted to import them into a 
> new temporary OSD following instructions from here. For some reason that 
> didn't work. The OSD was up+in but wasn't backfilling the PGs into the 
> cluster. If you find yourself in this process I would try that first just in 
> case it provides a cleaner process.
> Considering the above didn't work and we were looking at the possibility of 
> losing the RBD volume (or perhaps worse, the potential of fruitlessly fscking 
> 35TB) I took what I might describe as heroic measures:
> 
> Running
> ceph pg dump | grep incomplete
> 
> 3.c7   0  0  0  0  0  0  0  incomplete  2015-04-02  20:49:32.968841  0'0  
> 15730:17  [15,0]  15  [15,0]  15  13985'54076  2015-03-31  19:14:22.721695  
> 13985'54076  2015-03-31  19:14:22.721695
> 3.102  0  0  0  0  0  0  0  incomplete  2015-04-02  20:49:32.529594  0'0  
> 15730:21  [0,15]  0   [0,15]  0   13985'53107  2015-03-29  21:17:15.568125  
> 13985'49195  2015-03-24  18:38:08.244769
> 
> Then I stopped all OSDs, which blocked all I/O to the cluster, with:
> stop ceph-osd-all
> 
> Then I looked for all copies of the PG on all OSDs with:
> for i in 3.c7 3.102 ; do find /var/lib/ceph/osd/ -maxdepth 3 -type d -name 
> "$i" ; done | sort -V
> 
> /var/lib/ceph/osd/ceph-0/current/3.c7_head
> /var/lib/ceph/osd/ceph-0/current/3.102_head
> /var/lib/ceph/osd/ceph-3/current/3.c7_head
> /var/lib/ceph/osd/ceph-13/current/3.102_head
> /var/lib/ceph/osd/ceph-15/current/3.c7_head
> /var/lib/ceph/osd/ceph-15/current/3.102_head
> 
> Then I flushed the journals for all of those OSDs with:
> for i in 0 3 13 15 ; do ceph-osd -i $i --flush-journal ; done
> 
> Then I removed all of those drives and moved them (using Journal Fixing 
> above) to Ithome where I used ceph_objectstore_tool to remove all traces of 
> 3.102 and 3.c7:
> for i in 0 3 13 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data 
> /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op 
> remove --pgid $j ; done ; done
> 
> Then I imported the PGs onto OSD.0 and OSD.15 with:
> for i in 0 15 ; do for j in 3.c7 3.102 ; do ceph_objectstore_tool --data 
> /var/lib/ceph/osd/ceph-$i --journal /var/lib/ceph/osd/ceph-$i/journal --op 
> import --file ~/${j}.export ; done ; done
> for i in 0 15 ; do ceph-osd -i $i --flush-journal && rm 
> /var/log/ceph/osd/ceph-$i/journal ; done
> 
> Then I moved the disks back to Storage1 and started them all back up again. I 
> think that this should have worked but what happened in this case was that 
> OSD.0 didn't start up for some reason. I initially thought that that wouldn't 
> matter because OSD.15 did start and so we should have had everything but a 
> ceph pg query of the PGs showed something like:
> "blocked": "peering is blocked due to down osds",
> "down_osds_we_would_probe": [0],
> "peering_blocked_by": [{
>     "osd": 0,
>     "current_lost_at": 0,
>     "comment": "starting or marking this osd lost may let us proceed"
> }]
> 
> So I then removed OSD.0 from the cluster and everything came back to life. 
> Thanks to Jean-Charles Lopez, Craig Lewis, and Paul Evans!
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to