[Linux-HA] corosync communication stops after link down

Matthias Ferdinand Wed, 24 Sep 2014 13:36:07 -0700

OS: Ubuntu 14.04 64bit
corosync: 2.3.3-1ubuntu1
2 nodes
2 rings (em1, bond0(p2p1,p1p1)) rrp_mode: active,
        all with crossover cables, no switches
transport: udpu



If the cluster is up for some time (here: ~ 1 week), and one node is
rebooted, corosync on the surviving node (no-carrier on all
corosync-related interfaces) does not resume
sending packets when links go up again after peer finished rebooting
(3-4 minutes link down; tcpdump on both nodes and both em1 and bond0
show: no packets from the surviving node). The rebooted node then cannot
see any neighbor and consequently decides to stonith the peer before
starting resources. But the resources still cannot run until the
stonith'd node is completely rebooted, because the drbd volumes became
outdated at "shutdown -r now" time.

Subsequent reboots do not show any problems. Repeat after ~ 1 week
uptime, and the problem shows up again.

This happened on two different cluster installs with rougly the same
hardware (Dell Poweredge R520 resp. R420, onboard Broadcom BCM5720 (em1),
2x2port Intel I350 (p2p1,p1p1)).


Any ideas?

Regards
  Matthias Ferdinand
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] corosync communication stops after link down

Reply via email to