I'm seeing one OSD spamming it's log with
2014-04-02 16:49:21.547339 7f5cc6c5d700 1 heartbeat_map is_healthy
'OSD::op_tp thread 0x7f5cc3456700' had timed out after 15
It starts about 30 seconds after the OSD daemon is started. It
continues until
2014-04-02 16:48:57.526925 7f0e5a683700 1 heartbeat_map is_healthy
'OSD::op_tp thread 0x7f0e3c857700' had suicide timed out after 150
2014-04-02 16:48:57.528008 7f0e5a683700 -1 common/HeartbeatMap.cc: In
function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*,
const char*, time_t)' thread 7f0e5a683700 time 2014-04-02 16:48:57.526948
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")
I tried bumping up logging, and I don't see anything interesting. I
tried strace, and all I can really see is that the OSD spends a lot of
time in FUTEX_WAIT.
This OSD has been flapping for several days now. None of the other OSDs
are having this issue.
I thought it might be similiar to Quenten Grasso's post about 'OSD
Restarts cause excessively high load average and "requests are blocked >
32 sec"'. At first it looks similiar, but Quenten said his OSDs
eventually settle down. Mine never does.
Can I increase that 15 second timeout, to see if it just needs
additional time? I don't see anything in the ceph docs about this.
Otherwise, I'm pretty close to removing the disk, zapping it, and add it
back to the cluster. Any other suggestions?
--
*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>
*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter
<http://www.twitter.com/centraldesktop> | Facebook
<http://www.facebook.com/CentralDesktop> | LinkedIn
<http://www.linkedin.com/groups?gid=147417> | Blog
<http://cdblog.centraldesktop.com/>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com