Hi Dejan, and thanks for responses,
yet several remarks below ...
Alain
> Hi,
> > 
> > I'm trying to clearly evaluate the risk of split brain and the risk of 
> > dual-fencing with pacemaker/openais in
> > the case I can't chose anything else but having only *one* network for 
>   
>
> Oops.
>
>   
>> > totem protocol :
>> >  
>> >  Let's say we have a two-nodes cluster with stonith resources :
>> >  - if there is a problem on one node (not a network pb) :
>> >               the other will became DC (if not yet) and fence the node 
>> > in failure.
>> >  - if there is a network failure between one node and the eth switch :
>> >             each node does not get any token anymore from the other 
>> > node, but only the
>> >             DC has the right to take a decision in the cluster and 
>> > specifically the decision to fence the
>> >             other node, so the DC node should fence the other.
>> >             The only problem I can see here is if the "not-DC" node 
>> > declares itself as new DC before to
>> >             be fenced, and therefore will also decide to fence the other 
>> > node, which could lead to a
>> >             dual-fencing situation.  So the fence request from the 
>> > initial DC node should happen before the
>> >             DC Deadtime value (default 60s) to eliminate any risk of 
>> > dual-fencing.
>>     
>
> Have you ever tried this? If that indeed makes the non-DC node
> wait with fencing, then that may help.
>   
No, it 's my "on paper understanding" , but I 'll try ...
>   
>> > In any cases, we can't have a split-brain situation if a fence does not 
>> > complete successfully. Am I right ?
>>     
>
> No. It is a split-brain situation as soon as nodes can't
> communicate.
>   
Ok, you're rigtht, in fact,  I wanted to talk about the risk of shared 
resources mounted on both sides, which
is in fact the worst thing that could happen in case of "split-brain" if 
no fencing occurs .
>   
>> > And if we have a more than two-nodes cluster, it seems similar for me ...
>>     
>
> No, because the partition without quorum can't fence nodes. That
> makes things simpler and more predictable.
>   
... what if no-quorum-policy=ignore ?
>   
>> > Am I right about all this ? or did I miss something somewhere ?
>>     
>
> I'm not sure if my response helps at all. You should test this
> thoroughly. For instance, we have one bugzilla open for
> external/ipmi where nodes did shoot each other on split brain.
>   
Could I have the bugzilla number ?
It's not really easy to test if we can have dual-fencing in case of 
network failure. For example,
I used to work with Cluster Suite for several years, with the two-nodes 
mode, and no quorum-disk
functionnality (it did not work fine in the begining) . In that case, 
there is a race to
fence between both nodes (no DC notion in CS), and RH  always  told that 
the probability
to have  a dual-fencing in case of heartbeat network problem is near 0 
but not 0.
OK fine, but I have some big customer's sites where I have hundreds of 
HA pairs,
and on these sites, despite probability is near 0 , it has happened 
several times, not many
but several. So, we can't really test this dual-fencing risk, I think we 
have to
rely on the behavior on paper only for this specific case, and try to 
get the configuration
which avoids for sure dual-fencing, and also  avoids shared resources 
mounted on both sides,
that's what I'm trying to find with Pacemaker & openais.

Thanks
Alain Moullé
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to