Trying to understand why some OSDs (6 out of 21) went down in my cluster while 
running a CBT radosbench benchmark.  From the logs below, is this a networking 
problem between systems, or is it some kind of FileStore problem.

Looking at one crashed OSD log, I see the following crash error:

2016-09-09 21:30:29.757792 7efc6f5f1700 -1 FileStore: sync_entry timed out 
after 600 seconds.
 ceph version 10.2.1-13.el7cp (f15ca93643fee5f7d32e62c3e8a7016c1fc1e6f4)

just before that I see things like:

2016-09-09 21:18:07.391760 7efc755fd700 -1 osd.12 165 heartbeat_check: no reply 
from osd.6 since back 2016-09-09 21:17:47.261601 front 2016-09-09 
21:17:47.261601 (cutoff 2016-09-09 21:17:47.391758)

and also

2016-09-09 19:03:45.788327 7efc53905700  0 -- 10.0.1.2:6826/58682 >> 
10.0.1.1:6832/19713 pipe(0x7efc8bfbc800 sd=65 :52000 s=1 pgs=12 cs=1 l=0\
 c=0x7efc8bef5b00).connect got RESETSESSION

and many warnings for slow requests.


All the other osds that died seem to have died with:

2016-09-09 19:11:01.663262 7f2157e65700 -1 common/HeartbeatMap.cc: In function 
'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*, const char*, 
time_t)' thread 7f2157e65700 time 2016-09-09 19:11:01.660671
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")


-- Tom Deneau, AMD





_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to