[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

Jorge Adrian Salaices Thu, 26 Jan 2012 17:52:21 -0800

I have been working on trying to convince Mgmt at work that we want togo to OCFS2 away from NFS for the sharing of the Application Layer ofour Oracle EBS (Enterprise Business Suite), and for just general "BackupShare", but general instability in my setup has dissuaded me torecommend it.

I have a mixture of 1.4.7 (EL 5.3) and 1.6.3 (EL 5.7 + UEK) andsomething as simple as an umount has triggered random Node reboots, evenon nodes that have Other OCFS2 mounts not shared by the rebooting nodes.You see the problem I have is that I have disparate hardware and some ofthese servers are even VM's.

Several documents state that nodes have to be somewhat equal of powerand specs and in my case that will never be.Unfortunately for me, I have had several other events of random Fencingthat have been unexplained by common checks.i.e. My Network has never been the problem yet one server may seeanother one go away when all of the other services on that node may berunning perfectly fine. I can only surmise that the reason why that mayhave been is because of an elevated load on the server that starved theHeartbeat process preventing it from sending Network packets to othernodes.

My config has about 40 Nodes on it, I have 4 or 5 different shared LUNsout of our SAN and not all servers share all Mounts.meaning only 10 or 12 share one LUN, 8 or 9 share another and 2 or 3share a third, unfortunately the complexity is such that a server mayintersect with some of the servers but not all.perhaps a change in my config to create separate clusters may be thesolution but only if a node can be part of multiple clusters:


/node:
        ip_port = 7777
        ip_address = 172.20.16.151
        number = 1
        name = txri-oprdracdb-1.tomkinsbp.com
        cluster = ocfs2-back

node:
        ip_port = 7777
        ip_address = 172.20.16.152
        number = 2
        name = txri-oprdracdb-2.tomkinsbp.com
        cluster = ocfs2-back

node:
        ip_port = 7777
        ip_address = 10.30.12.172
        number = 4
        name = txri-util01.tomkinsbp.com
        cluster = ocfs2-util, ocfs2-back
node:
        ip_port = 7777
        ip_address = 10.30.12.94
        number = 5
        name = txri-util02.tomkinsbp.com
        cluster = ocfs2-util, ocfs2-back

cluster:
        node_count = 2
        name = ocfs2-back

cluster:
        node_count = 2
        name = ocfs2-util
/
Is this even Legal, or can it be done some other way ?

or is this done based on the Different DOMAINS that are created once amount is done .

How can I make the cluster more stable ? and Why does a node fenceitself on the cluster even if it does Not have any locks on the sharedLUN ? It seems to be that the node may be "fenceable" simply by havingthe OCFS2 services turned ON, without a mount .

is this correct ?

Another question I have been having as well is: can the Fencing methodbe other than Panic or restart ? Can a third party or a Userland eventbe triggered to recover from what may be construed by the "Heartbeat" or"Network tests" as a downed node ?


Thanks for any of the help you can give me.


--
Jorge Adrian Salaices
Sr. Linux Engineer
Tomkins Building Products

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
http://oss.oracle.com/mailman/listinfo/ocfs2-users

[Ocfs2-users] Help ! OCFS2 unstable on Disparate Hardware

Reply via email to