[Linux-HA] ocfs2 file system clone resource crashes when one node dies

Muhammad Sharfuddin Mon, 12 Dec 2011 00:36:58 -0800

its a two node cluster with online updates.

grpSAPINSTresources is a group that runs on node2, while
grpSAPDBresources is another group that runs on node1.
clone-grpDlmO2cb and clone-grpSharedFS are clone resources that runs on
both nodes, and is a must for grpSAPDBresources and grpSAPINSTresources
(via colocation and order rules)


when both nodes are online this cluster behaves normal, but when I
reboot(or killall -9 corosync) on any node say node2(that runs
grpSAPINSTresources), this cluster crashes.

I opened a support ticket with Novell, and as per the Support Engineer
this cluster misbehaves due to ocfs2 file system.. i.e ocfs2 has
issues. 

Novell Support provided me following rpms:
libdlm3-3.00.01-0.10.6.3527.1.PTF.707049.x86_64
libdlm-3.00.01-0.10.6.3527.1.PTF.707049.x86_64
ocfs2-kmp-default-1.6_2.6.32.45_0.3-0.4.2.1.3431.10.PTF.712568.x86_64

and asked me run to run fsck on ocfs2 devices, as per their instruction,
I installed the rpms and run fsck on ocfs2 file systems ... but no
luck ;(

I am copying Support Reply, for your kind consideration. Please
suggest/recommend

<snip>
> 
> as I already informed that 'grpSAPINSTresources' is a group of
following primitive resources:
>     i   IP-for-SAPInst-ASCS ocf:heartbeat:IPaddr2
>     ii  IP-for-SAPInst-CI ocf:heartbeat:IPaddr2
>     iii SAPInst-ASCS00 ocf:heartbeat:SAPInstance
>     iv  SAPInst-DVEBMGS01 ocf:heartbeat:SAPInstance
>     v   SAPInst-D02 ocf:heartbeat:SAPInstance
> 
> and 'grpSAPDBresources' is a group of following primitive resources:
>     i  IP-for-ORADB ocf:heartbeat:IPaddr2
>     ii  FS-ORACLE ocf:heartbeat:Filesystem
>     iii SAPDBInstance ocf:heartbeat:SAPDatabase
> 
> due to following location rules 'grpSAPDBresources' always starts on
node sap-prd1, while 'grpSAPINSTresources' always starts on node
sap-prd2
>     location PrimaryLoc-of-grpSAPDBresources grpSAPDBresources +inf:
sap-prd1
>     location PrimaryLoc-of-grpSAPINSTresources grpSAPINSTresources
+inf: sap-prd2
> 
> Now cluster was running ok today, i.e grpSAPDBresources was running on
node 'sap-prd1' and grpSAPINSTresources was running on node 'sap-prd2',
then for testing I
> rebooted
> the node 'sap-prd2', within  50 seconds grpSAPINSTresources started on
node 'sap-prd1' where already 'grpSAPDBresources' was running.. when
sap-prd2 came back(after
> a reboot)
> grpSAPINSTresources left the sap-prd1 and moved to its preferred
location, i.e on node 'sap-prd2'.

ok, this one seems to have worked.

> After doing the above exercise I once again rebooted the node
sap-prd2, and this time as usual grpSAPINSTresources was trying to start
on node 'sap-prd1'.. but
> failed. Only IP resources of the
> group(IP-for-SAPInst-ASCS and  IP-for-SAPInst-CI) were started while
SAPInst-ASCS00, SAPInst-DVEBMGS01 & SAPInst-D02 were failed to start.
> logs from /var/log/messages are attached where you can see that
SAPInst-ASCS00 was not started by cluster, its attached as
when_SAPInst-ASCS00_failed.pdf

the ocfs2 error message again

> Then I run the cleanup
> crm resource cleanup SAPInst-ASCS00
> still SAPInst-ASCS00 remained stopped, and got the messages attached
as crm_resource_cleanup_SAPInst-ASCS00 part1.pdf
> after 60 seconds, ran the cleanup on ASCS00 once again without luck
i.e SAPInst-ASCS00 remained stopped, and got messages attached as
> crm_resource_cleanup_SAPInst-ASCS00 part2.pdf

I only see the successful start operation being issued. Not that the
start was successful.
> 
> Then I try to stop  IP-for-SAPInst-CI without luck and got following
messages attached as crm_resource_stop_IP-for-SAPInst-CI.pdf, then
running
> following command stopped the IP-for-SAPInst-CI
>     crm resource cleanup IP-for-SAPInst-CI

Now I see the failed start of ASCS00, which is probably a followup from
the ocfs2 error above.

> Likewise I ran crm resource stop IP-for-SAPInst-ASCS without luck,
while IP-for-SAPInst-ASCS only stopped via running crm resource cleanup
IP-for-SAPInst-ASCS.
> 
> And when all of the member resources of grpSAPINSTresources were
stopped I ran the following command:
>     crm resource cleanup grpSAPINSTresources
> 
> And running the above command started the SAPInst-DVEBMGS01 which is
not acceptable, because its the fourth resource in the group and when
> first three resources(IP-for-SAPInst-ASCS,  IP-for-SAPInst-CI &
SAPInst-ASCS00) were stopped how  SAPInst-DVEBMGS01 could be started ?
> 
> Then I dont know how I have stopped the 'grpSAPINSTresources' but when
I tried to stopped(after stopping grpSAPINSTresources) the
'grpSAPDBresources', via crm
> resource stop grpSAPDBresource didnt stopped, and then I have to ran
the crm resource cleanup to stop the 'grpSAPDBresources'
> 
> like wise running crm resource cleanup on clone-grpSharedFS,
clone-grpDlmO2cb and sbd_stonith resources didnt work unless I ran the
crm resource cleanup
> 
> and when all of the resources were stopped, I tried to stopped the
openais service, and it took 20 minutes, I ran the 'rcopenais stop' at
19:27:35, and it was
> completed at 19:47:36.
> 
> hb_report output is attached. Also for the sake of convenience, I have
attached  the 'issue.pdf' file that I have already attached/uploaded to
the SR website when I
> opened this SR.

intial test, ok

Dec  4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move
IP-for-SAPInst-ASCS       (Started sap-prd1 -> sap-prd2)
Dec  4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move
IP-for-SAPInst-CI (Started sap-prd1 -> sap-prd2)
Dec  4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move
SAPInst-ASCS00    (Started sap-prd1 -> sap-prd2)
Dec  4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move
SAPInst-DVEBMGS01 (Started sap-prd1 -> sap-prd2)
Dec  4 18:56:31 sap-prd1 pengine: [7907]: notice: LogActions: Move
SAPInst-D02       (Started sap-prd1 -> sap-prd2)

another switch off

Dec  4 19:00:57 sap-prd1 corosync[7898]:   [TOTEM ] A processor failed,
forming new configuration.
Dec  4 19:01:00 sap-prd1 cib: [7904]: info: cib_stats: Processed 207
operations (1835.00us average, 0% utilization) in the last 10min
Dec  4 19:01:01 sap-prd1 corosync[7898]:   [CLM   ] CLM CONFIGURATION
CHANGE
Dec  4 19:01:01 sap-prd1 corosync[7898]:   [CLM   ] New Configuration:
Dec  4 19:01:01 sap-prd1 corosync[7898]:   [CLM   ]     r(0)
ip(192.168.10.216) 
Dec  4 19:01:01 sap-prd1 corosync[7898]:   [CLM   ] Members Left:
Dec  4 19:01:01 sap-prd1 corosync[7898]:   [CLM   ]     r(0)
ip(192.168.10.217)

cluster moves resources

Dec  4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move
IP-for-SAPInst-ASCS       (Started sap-prd2 -> sap-prd1)
Dec  4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move
IP-for-SAPInst-CI (Started sap-prd2 -> sap-prd1)
Dec  4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move
SAPInst-ASCS00    (Started sap-prd2 -> sap-prd1)
Dec  4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move
SAPInst-DVEBMGS01 (Started sap-prd2 -> sap-prd1)
Dec  4 19:01:01 sap-prd1 pengine: [7907]: notice: LogActions: Move
SAPInst-D02       (Started sap-prd2 -> sap-prd1)

cluster fences

Dec  4 19:01:01 sap-prd1 sbd: [22746]: info: sap-prd2 owns slot 1
Dec  4 19:01:01 sap-prd1 sbd: [22746]: info: Writing reset to node slot
sap-prd2

ocfs2 problem hits PRD-ASCS00

Dec  4 19:01:16 sap-prd1 SAPInstance[23036]: INFO: Starting SAP Instance
PRD-ASCS00:  04.12.2011 19:01:16 Start OK
Dec  4 19:01:16 sap-prd1 kernel: [162440.684637]
(sapstart,23439,0):ocfs2_truncate_file:457 ERROR: bug expression:
le64_to_cpu(fe->i_size) != i_size_read(inode)
Dec  4 19:01:16 sap-prd1 kernel: [162440.684643]
(sapstart,23439,0):ocfs2_truncate_file:457 ERROR: Inode 558723, inode
i_size = 878 != di i_size = 730, i_flags = 0x1

this won't end well

Dec  4 19:06:12 sap-prd1 lrmd: [7905]: WARN: SAPInst-ASCS00:start
process (PID 23036) timed out (try 1).  Killing with signal SIGTERM
(15).

another attempt after cleanup

Dec  4 19:08:59 sap-prd1 lrmd: [7905]: info: rsc:SAPInst-ASCS00
start[85] (pid 31322)
Dec  4 19:08:59 sap-prd1 SAPInstance[31322]: INFO: Starting SAP Instance
PRD-ASCS00:  04.12.2011 19:08:59 Start OK

fails again, probably because of the ocfs2 issue

Dec  4 19:13:59 sap-prd1 lrmd: [7905]: WARN: SAPInst-ASCS00:start
process (PID 31322) timed out (try 1).  Killing with signal SIGTERM
(15).
Dec  4 19:13:59 sap-prd1 lrmd: [7905]: WARN: operation start[85] on
SAPInst-ASCS00 for client 7908: pid 31322 timed out

cluster tries to force ASCS00 away again, but has nowhere to go

Dec  4 19:14:59 sap-prd1 pengine: [7907]: WARN: unpack_rsc_op:
Processing failed op SAPInst-ASCS00_start_0 on sap-prd1: unknown error
(1)
Dec  4 19:14:59 sap-prd1 pengine: [7907]: WARN: common_apply_stickiness:
Forcing SAPInst-ASCS00 away from sap-prd1 after 1000000 failures
(max=1000000)

issue arises with SAPInst-D02

Dec  4 19:16:49 sap-prd1 lrmd: [7905]: info: rsc:SAPInst-D02 probe[94]
(pid 7627)
Dec  4 19:16:49 sap-prd1 SAPInstance[7627]: ERROR: SAP instance service
disp+work is not running with status GRAY !

failed stop, this is fatal for that node

Dec  4 19:21:49 sap-prd1 lrmd: [7905]: WARN: SAPInst-DVEBMGS01:stop
process (PID 7710) timed out (try 1).  Killing with signal SIGTERM (15).
Dec  4 19:21:49 sap-prd1 lrmd: [7905]: WARN: operation stop[95] on
SAPInst-DVEBMGS01 for client 7908: pid 7710 timed out

and the cluster would like to fence, but the node cannot commit suicide

Dec  4 19:22:49 sap-prd1 pengine: [7907]: WARN: stage6: Scheduling Node
sap-prd1 for STONITH
Dec  4 19:22:49 sap-prd1 pengine: [7907]: WARN: native_stop_constraints:
Stop of failed resource SAPInst-DVEBMGS01 is implicit after sap-prd1 is
fenced

and thats why it takes until finally the Shutdown Escalation kicks in to
stop

Dec  4 19:47:35 sap-prd1 crmd: [7908]: ERROR: crm_timer_popped: Shutdown
Escalation (I_STOP) just popped! (1200000ms)

So the cluster did in theory fine. But the resources failed.

If there is any failure, please stop the cluster stack on both nodes and
install the
PTF rpms I sent you previously. Then run ldconfig again. Start the
cluster stack and 
stop the Filesystems. Run fsck.ocfs2 again. Start the Filesystems and
check again.

If it still does not work, please provide another hb_report from the
test then.
</snip>
I ran the fsck on ocfs2 file systems and also installed the ptf rpms,
but no luck still.

Please help/suggest
--
Regards,

Muhammad Sharfuddin
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] ocfs2 file system clone resource crashes when one node dies

Reply via email to