[ceph-users] Help needed to recover from cache tier OSD crash

Dmitry Sat, 28 Jul 2018 14:36:02 -0700

Hello all,

would someone please help with recovering from a recent failure of all cache 
tier pool OSDs?


My CEPH cluster has a usual replica 2 pool with two 500GB SSD OSD’s writeback 
cache tier over it (also replica 2). 

Both cache OSD’s were created with standard ceph deploy tool, and have 2 
partitions (one journal and one XFS).

The target_max_bytes parameter for this cache pool was set to a 70% of the size 
of a single SSD disk to avoid overflow. This configuration worked fine for 
years..

But recently, for some unknown reason, when exporting large 300GB raw RBD image 
with 'rbd export’ command, both cache OSDs got 100% full and crashed.

In attempt to flush all the data from the cache to the underlying pool and 
avoid further damage, I have switched the cache pool into ‘forward’ mode and 
restarted both cache OSDs.

Both worked for some minutes and segfaulted again, and do not start anymore. 
Debugging the crash errors, I found out that the error is related to decoding 
attributes.

When checked with 'getfattr -d’ on random object files and directories on 
affected OSDs, I discovered that there are NO extended attributes exist at all 
anymore.

So I suspect that due to filesystems getting 100% full and restarting the OSD 
daemons several times, the XFS was somehow corrupted and lost the extended 
attributes which are required for CEPH to operate.


The question is - is it possible to somehow recover the attributes or flush the 
cached data back to the cold storage pool?

Would someone advise or help to recover the data please?

—
Regards,
Dmit

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Help needed to recover from cache tier OSD crash

Reply via email to