Oh, it's our fault. Public_addr and cluster_addr use the same NIC(eth1). But we found during recovering heartbeat may timeout because of busy traffic. I *misunderstood* the mean of heartbeat and use another NIC(eth0) address for heartbeat to avoid timeout.
From your points, it's easy to understand. And I see the code comments(src/ceph-osd.cc) claim the usage. Best Wishes! > 在 2014年7月20日,1:14,Gregory Farnum <g...@inktank.com> 写道: > > The heartbeat code is very careful to use the same physical interfaces as > 1) the cluster network > 2) the public network > > If the first breaks, the OSD can't talk with its peers. If the second > breaks, it can't talk with the monitors or clients. Either way, the > OSD can't do its job so it gets marked down. > -Greg > Software Engineer #42 @ http://inktank.com | http://ceph.com > > >> On Sat, Jul 19, 2014 at 3:08 AM, Haomai Wang <haomaiw...@gmail.com> wrote: >> Hi all, >> >> Our production ceph node each has two NIC, one used by heartbeat >> another used by cluster_network. >> >> By accident, the heartbeat NIC is broken but the cluster_network NIC >> is healthy. But osds report the broken NIC node is unavailable, so >> monitor decide to kick out the node. >> >> I'm not sure what I describe match the code logic, if so, is it more >> reasonable that ceph-osd process can detect cluster_network is healthy >> so we don't kick out the broken node. >> >> -- >> Best Regards, >> >> Wheat >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com