Hello ceph community,
We need some immediate help that our cluster is in a very strange and bad 
status after unexpected reboot of many OSD nodes in a very short time frame.

We have a cluster with 195 osd configured on 9 different OSD nodes, original 
version 0.80.5.
After some issue of the datacenter, at least 5 OSD nodes rebooted and after 
reboot not all OSDs goes up then trigger a lot of recovery, also many PGs goes 
into dead / incomplete state.

Then we try to restart OSD, and found OSD keep crashes with error "FAILED 
assert(log.head >= olog.tail && olog.head >= log.tail)", so we upgrade to 
0.80.7 which covers fix of #9482, however we still see the error with different 
behavior:
0.80.5: once OSD crashes with this error, any trial to restart the OSD, it will 
crash with same error at the end
0.80.7: OSD can be restarted, but after some time, there will be another OSD 
will crash with this error

We also tried to set nobackfill and norecover flag but doesn't help.


So the cluster get stuck that we cannot bring more osd back.

Any suggestion that we may have the chance to recover the cluster?
Many thanks,



Luke Kao

MYCOM-OSI
<http://www.mycom-osi.com>

________________________________

This electronic message contains information from Mycom which may be privileged 
or confidential. The information is intended to be for the use of the 
individual(s) or entity named above. If you are not the intended recipient, be 
aware that any disclosure, copying, distribution or any other use of the 
contents of this information is prohibited. If you have received this 
electronic message in error, please notify us by post or telephone (to the 
numbers or correspondence address above) or by email (at the email address 
above) immediately.
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to