Hi Dejan, and thanks for responses, yet several remarks below ... Alain > Hi, > > > > I'm trying to clearly evaluate the risk of split brain and the risk of > > dual-fencing with pacemaker/openais in > > the case I can't chose anything else but having only *one* network for > > > Oops. > > >> > totem protocol : >> > >> > Let's say we have a two-nodes cluster with stonith resources : >> > - if there is a problem on one node (not a network pb) : >> > the other will became DC (if not yet) and fence the node >> > in failure. >> > - if there is a network failure between one node and the eth switch : >> > each node does not get any token anymore from the other >> > node, but only the >> > DC has the right to take a decision in the cluster and >> > specifically the decision to fence the >> > other node, so the DC node should fence the other. >> > The only problem I can see here is if the "not-DC" node >> > declares itself as new DC before to >> > be fenced, and therefore will also decide to fence the other >> > node, which could lead to a >> > dual-fencing situation. So the fence request from the >> > initial DC node should happen before the >> > DC Deadtime value (default 60s) to eliminate any risk of >> > dual-fencing. >> > > Have you ever tried this? If that indeed makes the non-DC node > wait with fencing, then that may help. > No, it 's my "on paper understanding" , but I 'll try ... > >> > In any cases, we can't have a split-brain situation if a fence does not >> > complete successfully. Am I right ? >> > > No. It is a split-brain situation as soon as nodes can't > communicate. > Ok, you're rigtht, in fact, I wanted to talk about the risk of shared resources mounted on both sides, which is in fact the worst thing that could happen in case of "split-brain" if no fencing occurs . > >> > And if we have a more than two-nodes cluster, it seems similar for me ... >> > > No, because the partition without quorum can't fence nodes. That > makes things simpler and more predictable. > ... what if no-quorum-policy=ignore ? > >> > Am I right about all this ? or did I miss something somewhere ? >> > > I'm not sure if my response helps at all. You should test this > thoroughly. For instance, we have one bugzilla open for > external/ipmi where nodes did shoot each other on split brain. > Could I have the bugzilla number ? It's not really easy to test if we can have dual-fencing in case of network failure. For example, I used to work with Cluster Suite for several years, with the two-nodes mode, and no quorum-disk functionnality (it did not work fine in the begining) . In that case, there is a race to fence between both nodes (no DC notion in CS), and RH always told that the probability to have a dual-fencing in case of heartbeat network problem is near 0 but not 0. OK fine, but I have some big customer's sites where I have hundreds of HA pairs, and on these sites, despite probability is near 0 , it has happened several times, not many but several. So, we can't really test this dual-fencing risk, I think we have to rely on the behavior on paper only for this specific case, and try to get the configuration which avoids for sure dual-fencing, and also avoids shared resources mounted on both sides, that's what I'm trying to find with Pacemaker & openais.
Thanks Alain Moullé _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
