[ceph-users] Regarding loss of heartbeats

Trygve Vea Tue, 29 Nov 2016 06:07:21 -0800

Since Jewel, we've seen quite a bit of funky behaviour in Ceph.  I've written 
about it a few times to the mailing list.


Higher CPU utilization after the upgrade / Loss of heartbeats.  We've looked at 
our network setup, and we've optimized some potential bottlenecks some places.

Interesting thing regarding loss of heartbeats.  We have observed OSDs running 
on the same host losing heartbeats against eachother.  I'm not sure why they 
are connected at all (we have had some remapped/degraded placement groups over 
the weekend, maybe that's why) - but I have a hard time pointing the finger at 
our network when the heartbeat is lost between two osds on the same server.


I've been staring myself blind at this problem for a while, and just now 
noticed a pretty new bug report that I want to believe is related to what I am 
experiencing: http://tracker.ceph.com/issues/18042

We had one OSD hit a suicide timeout value and kill itself off last night, and 
one can see that several of these heartbeats are between osds on the same node. 
 (zgrep '10.22.9.21.*10.22.9.21' ceph-osd.2.gz)

http://employee.tv.situla.bitbit.net/ceph-osd.2.gz


Does anyone have any thoughts about this?  Are we stumbling on a known, or 
unknown bug in Ceph?


Regards
-- 
Trygve Vea
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Regarding loss of heartbeats

Reply via email to