Oh, it's our fault.

Public_addr and cluster_addr use the same NIC(eth1). But we found during 
recovering heartbeat may timeout because of busy traffic. I *misunderstood* the 
mean of heartbeat and use another NIC(eth0) address for heartbeat to avoid 
timeout.

From your points, it's easy to understand. And I see the code 
comments(src/ceph-osd.cc) claim the usage.

Best Wishes!

> 在 2014年7月20日,1:14,Gregory Farnum <g...@inktank.com> 写道:
> 
> The heartbeat code is very careful to use the same physical interfaces as
> 1) the cluster network
> 2) the public network
> 
> If the first breaks, the OSD can't talk with its peers. If the second
> breaks, it can't talk with the monitors or clients. Either way, the
> OSD can't do its job so it gets marked down.
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> 
> 
>> On Sat, Jul 19, 2014 at 3:08 AM, Haomai Wang <haomaiw...@gmail.com> wrote:
>> Hi all,
>> 
>> Our production ceph node each has two NIC, one used by heartbeat
>> another used by cluster_network.
>> 
>> By accident, the heartbeat NIC is broken but the cluster_network NIC
>> is healthy. But osds report the broken NIC node is unavailable, so
>> monitor decide to kick out the node.
>> 
>> I'm not sure what I describe match the code logic, if so, is it more
>> reasonable that ceph-osd process can detect cluster_network is healthy
>> so we don't kick out the broken node.
>> 
>> --
>> Best Regards,
>> 
>> Wheat
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to