its a two node cluster with online updates. grpSAPINSTresources is a group that runs on node2, while grpSAPDBresources is another group that runs on node1. clone-grpDlmO2cb and clone-grpSharedFS are clone resources that runs on both nodes, and is a must for grpSAPDBresources and grpSAPINSTresources (via colocation and order rules)
when both nodes are online this cluster behaves normal, but when I reboot(or killall -9 corosync) on any node say node2(that runs grpSAPINSTresources), this cluster crashes. I opened a support ticket with Novell, and as per the Support Engineer this cluster misbehaves due to ocfs2 file system.. i.e ocfs2 has issues. Novell Support provided me following rpms: libdlm3-3.00.01-0.10.6.3527.1.PTF.707049.x86_64 libdlm-3.00.01-0.10.6.3527.1.PTF.707049.x86_64 ocfs2-kmp-default-1.6_2.6.32.45_0.3-0.4.2.1.3431.10.PTF.712568.x86_64 and asked me run to run fsck on ocfs2 devices, as per their instruction, I installed the rpms and run fsck on ocfs2 file systems ... but no luck ;( I am copying Support Reply, for your kind consideration. Please suggest/recommend <snip> > > as I already informed that 'grpSAPINSTresources' is a group of following primitive resources: > i IP-for-SAPInst-ASCS ocf:heartbeat:IPaddr2 > ii IP-for-SAPInst-CI ocf:heartbeat:IPaddr2 > iii SAPInst-ASCS00 ocf:heartbeat:SAPInstance > iv SAPInst-DVEBMGS01 ocf:heartbeat:SAPInstance > v SAPInst-D02 ocf:heartbeat:SAPInstance > > and 'grpSAPDBresources' is a group of following primitive resources: > i IP-for-ORADB ocf:heartbeat:IPaddr2 > ii FS-ORACLE ocf:heartbeat:Filesystem > iii SAPDBInstance ocf:heartbeat:SAPDatabase > > due to following location rules 'grpSAPDBresources' always starts on node sap-prd1, while 'grpSAPINSTresources' always starts on node sap-prd2 > location PrimaryLoc-of-grpSAPDBresources grpSAPDBresources +inf: sap-prd1 > location PrimaryLoc-of-grpSAPINSTresources grpSAPINSTresources +inf: sap-prd2 > > Now cluster was running ok today, i.e grpSAPDBresources was running on node 'sap-prd1' and grpSAPINSTresources was running on node 'sap-prd2', then for testing I > rebooted > the node 'sap-prd2', within 50 seconds grpSAPINSTresources started on node 'sap-prd1' where already 'grpSAPDBresources' was running.. when sap-prd2 came back(after > a reboot) > grpSAPINSTresources left the sap-prd1 and moved to its preferred location, i.e on node 'sap-prd2'. ok, this one seems to have worked. > After doing the above exercise I once again rebooted the node sap-prd2, and this time as usual grpSAPINSTresources was trying to start on node 'sap-prd1'.. but > failed. Only IP resources of the > group(IP-for-SAPInst-ASCS and IP-for-SAPInst-CI) were started while SAPInst-ASCS00, SAPInst-DVEBMGS01 & SAPInst-D02 were failed to start. > logs from /var/log/messages are attached where you can see that SAPInst-ASCS00 was not started by cluster, its attached as when_SAPInst-ASCS00_failed.pdf the ocfs2 error message again > Then I run the cleanup > crm resource cleanup SAPInst-ASCS00 > still SAPInst-ASCS00 remained stopped, and got the messages attached as crm_resource_cleanup_SAPInst-ASCS00 part1.pdf > after 60 seconds, ran the cleanup on ASCS00 once again without luck i.e SAPInst-ASCS00 remained stopped, and got messages attached as > crm_resource_cleanup_SAPInst-ASCS00 part2.pdf I only see the successful start operation being issued. Not that the start was successful. > > Then I try to stop IP-for-SAPInst-CI without luck and got following messages attached as crm_resource_stop_IP-for-SAPInst-CI.pdf, then running > following command stopped the IP-for-SAPInst-CI > crm resource cleanup IP-for-SAPInst-CI Now I see the failed start of ASCS00, which is probably a followup from the ocfs2 error above. > Likewise I ran crm resource stop IP-for-SAPInst-ASCS without luck, while IP-for-SAPInst-ASCS only stopped via running crm resource cleanup IP-for-SAPInst-ASCS. > > And when all of the member resources of grpSAPINSTresources were stopped I ran the following command: > crm resource cleanup grpSAPINSTresources > > And running the above command started the SAPInst-DVEBMGS01 which is not acceptable, because its the fourth resource in the group and when > first three resources(IP-for-SAPInst-ASCS, IP-for-SAPInst-CI & SAPInst-ASCS00) were stopped how SAPInst-DVEBMGS01 could be started ? > > Then I dont know how I have stopped the 'grpSAPINSTresources' but when I tried to stopped(after stopping grpSAPINSTresources) the 'grpSAPDBresources', via crm > resource stop grpSAPDBresource didnt stopped, and then I have to ran the crm resource cleanup to stop the 'grpSAPDBresources' > > like wise running crm resource cleanup on clone-grpSharedFS, clone-grpDlmO2cb and sbd_stonith resources didnt work unless I ran the crm resource cleanup > > and when all of the resources were stopped, I tried to stopped the openais service, and it took 20 minutes, I ran the 'rcopenais stop' at 19:27:35, and it was > completed at 19:47:36. > > hb_report output is attached. Also for the sake of convenience, I have attached the 'issue.pdf' file that I have already attached/uploaded to the SR website when I > opened this SR. intial test, ok Dec 4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move IP-for-SAPInst-ASCS (Started sap-prd1 -> sap-prd2) Dec 4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move IP-for-SAPInst-CI (Started sap-prd1 -> sap-prd2) Dec 4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move SAPInst-ASCS00 (Started sap-prd1 -> sap-prd2) Dec 4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move SAPInst-DVEBMGS01 (Started sap-prd1 -> sap-prd2) Dec 4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move SAPInst-D02 (Started sap-prd1 -> sap-prd2) another switch off Dec 4 19:00:57 sap-prd1 corosync[7898]: [TOTEM ] A processor failed, forming new configuration. Dec 4 19:01:00 sap-prd1 cib: [7904]: info: cib_stats: Processed 207 operations (1835.00us average, 0% utilization) in the last 10min Dec 4 19:01:01 sap-prd1 corosync[7898]: [CLM ] CLM CONFIGURATION CHANGE Dec 4 19:01:01 sap-prd1 corosync[7898]: [CLM ] New Configuration: Dec 4 19:01:01 sap-prd1 corosync[7898]: [CLM ] r(0) ip(192.168.10.216) Dec 4 19:01:01 sap-prd1 corosync[7898]: [CLM ] Members Left: Dec 4 19:01:01 sap-prd1 corosync[7898]: [CLM ] r(0) ip(192.168.10.217) cluster moves resources Dec 4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move IP-for-SAPInst-ASCS (Started sap-prd2 -> sap-prd1) Dec 4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move IP-for-SAPInst-CI (Started sap-prd2 -> sap-prd1) Dec 4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move SAPInst-ASCS00 (Started sap-prd2 -> sap-prd1) Dec 4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move SAPInst-DVEBMGS01 (Started sap-prd2 -> sap-prd1) Dec 4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move SAPInst-D02 (Started sap-prd2 -> sap-prd1) cluster fences Dec 4 19:01:01 sap-prd1 sbd: [22746]: info: sap-prd2 owns slot 1 Dec 4 19:01:01 sap-prd1 sbd: [22746]: info: Writing reset to node slot sap-prd2 ocfs2 problem hits PRD-ASCS00 Dec 4 19:01:16 sap-prd1 SAPInstance[23036]: INFO: Starting SAP Instance PRD-ASCS00: 04.12.2011 19:01:16 Start OK Dec 4 19:01:16 sap-prd1 kernel: [162440.684637] (sapstart,23439,0):ocfs2_truncate_file:457 ERROR: bug expression: le64_to_cpu(fe->i_size) != i_size_read(inode) Dec 4 19:01:16 sap-prd1 kernel: [162440.684643] (sapstart,23439,0):ocfs2_truncate_file:457 ERROR: Inode 558723, inode i_size = 878 != di i_size = 730, i_flags = 0x1 this won't end well Dec 4 19:06:12 sap-prd1 lrmd: [7905]: WARN: SAPInst-ASCS00:start process (PID 23036) timed out (try 1). Killing with signal SIGTERM (15). another attempt after cleanup Dec 4 19:08:59 sap-prd1 lrmd: [7905]: info: rsc:SAPInst-ASCS00 start[85] (pid 31322) Dec 4 19:08:59 sap-prd1 SAPInstance[31322]: INFO: Starting SAP Instance PRD-ASCS00: 04.12.2011 19:08:59 Start OK fails again, probably because of the ocfs2 issue Dec 4 19:13:59 sap-prd1 lrmd: [7905]: WARN: SAPInst-ASCS00:start process (PID 31322) timed out (try 1). Killing with signal SIGTERM (15). Dec 4 19:13:59 sap-prd1 lrmd: [7905]: WARN: operation start[85] on SAPInst-ASCS00 for client 7908: pid 31322 timed out cluster tries to force ASCS00 away again, but has nowhere to go Dec 4 19:14:59 sap-prd1 pengine: [7907]: WARN: unpack_rsc_op: Processing failed op SAPInst-ASCS00_start_0 on sap-prd1: unknown error (1) Dec 4 19:14:59 sap-prd1 pengine: [7907]: WARN: common_apply_stickiness: Forcing SAPInst-ASCS00 away from sap-prd1 after 1000000 failures (max=1000000) issue arises with SAPInst-D02 Dec 4 19:16:49 sap-prd1 lrmd: [7905]: info: rsc:SAPInst-D02 probe[94] (pid 7627) Dec 4 19:16:49 sap-prd1 SAPInstance[7627]: ERROR: SAP instance service disp+work is not running with status GRAY ! failed stop, this is fatal for that node Dec 4 19:21:49 sap-prd1 lrmd: [7905]: WARN: SAPInst-DVEBMGS01:stop process (PID 7710) timed out (try 1). Killing with signal SIGTERM (15). Dec 4 19:21:49 sap-prd1 lrmd: [7905]: WARN: operation stop[95] on SAPInst-DVEBMGS01 for client 7908: pid 7710 timed out and the cluster would like to fence, but the node cannot commit suicide Dec 4 19:22:49 sap-prd1 pengine: [7907]: WARN: stage6: Scheduling Node sap-prd1 for STONITH Dec 4 19:22:49 sap-prd1 pengine: [7907]: WARN: native_stop_constraints: Stop of failed resource SAPInst-DVEBMGS01 is implicit after sap-prd1 is fenced and thats why it takes until finally the Shutdown Escalation kicks in to stop Dec 4 19:47:35 sap-prd1 crmd: [7908]: ERROR: crm_timer_popped: Shutdown Escalation (I_STOP) just popped! (1200000ms) So the cluster did in theory fine. But the resources failed. If there is any failure, please stop the cluster stack on both nodes and install the PTF rpms I sent you previously. Then run ldconfig again. Start the cluster stack and stop the Filesystems. Run fsck.ocfs2 again. Start the Filesystems and check again. If it still does not work, please provide another hb_report from the test then. </snip> I ran the fsck on ocfs2 file systems and also installed the ptf rpms, but no luck still. Please help/suggest -- Regards, Muhammad Sharfuddin _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
