On 5/29/14 01:09 , Felix Lee wrote:
Dear experts,
Recently, a disk for one of our OSDs was failure and caused osd down, after I recovered the disk and filesystem, I noticed two problems:

1. journal corruption, which causes osd failure from starting:



2. I guess I may use ceph-osd with "--mkjournal" option to fix journal corruption issue, but there is another thing that bothers me, which is, the previous osd daemon is staying in "D" state, so, it can't be terminated, but usually, when filesystem recovered, process should be able to leave D state, so, I am not sure what causes this and if I can ignore that without causing any bad consequence.

In any case, it would be very grateful if you experts could shed me some light.

Our current ceph version is ceph-0.72.2-0.el6.x86_64
And, the filesystem backend is xfs with fiber direct attached storages.


I can't speak to the specific errors you're seeing, but it looks like you have a failing or corrupted disk.

Things I would investigate:

1. Is the disk itself failing?  If this were a SATA disk, I'd check the
   SMART stats on the disk.  I haven't dealt with Fiber Channel disks
   since before SMART was standardized, so I can't tell you do do that.
2. Get rid of the old ceph-osd process.  Reboot the node if you have
   to.  If things come up cleanly, then you're done.
3. Fsck the filesystem.  If the FS is clean, then you probably
   corrupted the OSD journal.
4. How quickly do you need this fixed?  At this point, I'm out of
   suggestions, so I'd remove the osd, zap it, and add it back in. If
   you can wait, somebody might have a better suggestion.


Fiber Channel hardware is much more complicated that SATA and SAS. There are a lot more parts involved, which leaves more room for bugs.

If you see this problem come back on the same disk, I'd replace the disk. If you see this happen again to other disks, I would get your Fiber Channel vendor involved. It wouldn't hurt to make sure you have the latest firmware on the disks, enclosure, and FC adapter.


--

*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter <http://www.twitter.com/centraldesktop> | Facebook <http://www.facebook.com/CentralDesktop> | LinkedIn <http://www.linkedin.com/groups?gid=147417> | Blog <http://cdblog.centraldesktop.com/>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to