Hi guys,

I keep you in touch with my issue...


Luis Freitas wrote:
Laurent,

What you need to be able to decide is what node still have network connectivity. If both have network connectivity you could fence any of them. If both lost connectivity (someone turned the switch off), then you are in trouble.

You will need to plug the backend network in a switch and monitor the interface status, so when one machine is shutdown or you disconnect its network cable, you still get the up status on the other machine. If you dont want to use two switches, plug them into the same switch and use different vlans.

Yes I achieved to do that. In my cluster manager, I'm able to know which node is still up before ocfs2 timers fence all nodes but the lower one, even if it's node0 which is off the network and node1 still connected.

To deal with OCFS2 I think the easiest approach is increase its timeouts to let your cluster manager decide which node will survive before the OCFS2 heartbeat fences the node. I wouldnt be messing with its inner workings, YMMV...

I think I managed to get time for my cluster manager to decide without having to increase ocfs2 timeouts.

But my problem is not here.
It's _HOW_ to cancel ocfs2 self-fencing on node 1 if I work out node0 have to be fenced and not node1.

I tried this :
node0 and node1 are OK, into the ocfs2 cluster, shared disk is mounted, all is fine. I guess both of them are writing their timers every two secs to their blocks in the "heartbeat system file",
as mentionned in the FAQ.

But what/where is "heartbeat system file", BTW ?

When I unplug node0 network link, both of them say they lost their netcomm to the peer. Within the five first seconds, my cluster manager works out node0 is off the network, and node1 is OK. So the decision to have node0 fenced and cancel fencing for node1 is taken (as node1 would have to be fenced according to ocfs2 decision of fencing the
upper node number and leave the lower alive).

So cluster manager runs "ocfs2_hb_ctl -K -d /dev/drbd0", which stops heartbeat on node1.

But this doesn't prevent node1 to be self-fencing 28 seconds after netcomm lost, and node0 to stay alive with its deceased card. My entire cluster is down. No more service,
nor data access, still available.

Logical, afterwards, as heartbeat was stopped but timers still countdown, nothing reset them.

*/Sunil Mushran <[EMAIL PROTECTED]>/* wrote:

    Each of those pings will require a timeout - short timeouts. So short
    that you
    may not even be able to distinguish between errors and overloaded
    run-queue,
    transmit queue, router, etc.


Once more I think I achieved that. My problem is to cancel self-fencing of node1,
not to decide to do so.


I'm sorry to annoy you, you might find it trivial but I probably missed something.

You wrote "one does not have to have 3 nodes when one only wants 2 nodes".
Great, this is fine for me as I don't (and can't) have SANs and drbd allows max 2 nodes
for disk-sharing.

I read too that behavior of fencing all nodes but the lower one is the wanted behavior
of ocfs2.

So I rephrase my question :

How can I make a 2-node cluster works with high-availablity, i.e. still having access to the remaining node in the eventuality of _ANY_ node failure ? Cluster will be degraded, only one node remaining until we repair and power up the node which failed, but no
services loss.
Even if node0 fails, node1 still assumes tasks, rather than self-fencing.

Once more thanks a lot for your help.

Have a good day,

best regards,

Laurent.


begin:vcard
fn:Laurent Neiger
n:Neiger;Laurent
org;quoted-printable:CNRS Grenoble;Centre R=C3=A9seau & Informatique Commun
adr:B.P. 166;;25, avenue des Martyrs;Grenoble;;38042;France
email;internet:[EMAIL PROTECTED]
title;quoted-printable:Administrateur Syst=C3=A8mes & R=C3=A9seaux
tel;work:(0033) (0)4 76 88 79 91
tel;fax:(0033) (0)4 76 88 12 95
note:Certificats : http://igc.services.cnrs.fr/Doc/General/trust.html
x-mozilla-html:TRUE
url:http://cric.grenoble.cnrs.fr
version:2.1
end:vcard

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

_______________________________________________
Ocfs2-users mailing list
[email protected]
http://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to