Hi,

we're running a two-node cluster with redundant rings.
Ring 0 is a 10 GB direct connection; ring 1 consists of two 1GB interfaces that are bonded in active-backup mode and routed through two independent switches for each node. The ring 1 network is our "normal" 1G LAN and should only be used in case the direct 10G connection should fail. I often (once a day on average, I'd guess) see that ring 1 (an only that one) is marked as
FAULTY without any obvious reasons.

Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Retransmit List: c76 c7a c7c c7e c80 c82 c84
Aug  2 08:56:15 node02 corosync[5752]:  [TOTEM ] Retransmit List: c82
Aug 2 08:56:15 node02 corosync[5752]: [TOTEM ] Marking seqid 568416 ringid 1 interface x.y.z.1 FAULTY - administrative intervention required.

Whenever I see this, I check if the other node's address can be pinged (I never saw any connectivity problems there), then reenable the ring with "corosync-cfgtool -r" and
everything looks ok for a while (i.e. hours or days).

How could I find out why this happens?
What do these "Retransmit List" or seqid (sequence id, I assume?) values tell me? Is it safe to reenable the second ring when the partner node can be pinged successfully?

The totem section on our config looks like this:

totem {
       rrp_mode:       passive
       join:   60
       max_messages:   20
       vsftype:        none
       consensus:      10000
       secauth:        on
       token_retransmits_before_loss_const:    10
       threads:        16
       token:  10000
       version:        2
       interface {
               bindnetaddr:    192.168.1.0
               mcastaddr:      239.250.1.1
               mcastport:      5405
               ringnumber:     0
       }
       interface {
               bindnetaddr:    x.y.z.0
               mcastaddr:      239.250.1.2
               mcastport:      5415
               ringnumber:     1
       }
       clear_node_high_bit:    yes
}

--
Sebastian

_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to