Hi, Thanks. So I have a robustness pb with Pacemaker/corosync ... you'll tell me if it seems normal or not , if I miss something or not : for example on two nodes, I make a script which configure 60 resnames with ocf pacemaker Dummy script, (30 with INFINITY location on node1 and 30 with INFINITY location on node2) and my test consists in a loop of [ 60 "crm resource start <resname>", sleep 60, 60 "crm resource stop <resname>", sleep 60 ] and so on ...
After 2 or 3 successful loops, pacemaker systematically fails on one node (meaning crm_mon does does connect anymore) and this node is fenced by the other one. I have tried several times and it is systematic. On the alive node, my script test displays : STOP LOOP : Number 3 Stop resname1 *Call cib_replace failed (-41): Remote node did not respond* <null> ERROR on Stop resname1 and when the fenced node is rebooted and I look in syslog, I only have these lines before the first boot Linux line : 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [4557]: debug: rsc:resname22:335: monitor 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [12818]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [4557]: debug: rsc:resname20:334: monitor 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [4557]: debug: rsc:resname18:333: monitor 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [12819]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER 1291896079 2010 Dec 9 13:01:19 node2 daemon debug lrmd [12820]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER 1291896079 2010 Dec 9 13:01:19 node2 daemon debug Dummy DEBUG: resname22 monitor : 0 1291896079 2010 Dec 9 13:01:19 node2 daemon debug Dummy DEBUG: resname18 monitor : 0 1291896079 2010 Dec 9 13:01:19 node2 daemon debug Dummy DEBUG: resname20 monitor : 0 1291896080 2010 Dec 9 13:01:20 node2 daemon debug lrmd [4557]: debug: rsc:resname60:373: monitor 1291896080 2010 Dec 9 13:01:20 node2 daemon debug lrmd [12839]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER 1291896080 2010 Dec 9 13:01:20 node2 daemon debug Dummy DEBUG: resname60 monitor : 0 1291896082 2010 Dec 9 13:01:22 node2 daemon err corosync [TOTEM ] FAILED TO RECEIVE 1291896082 2010 Dec 9 13:01:22 node2 daemon debug corosync [TOTEM ] entering GATHER state from 6. 1291896328 2010 Dec 9 13:05:28 node2 syslog notice syslog-ng syslog-ng starting up; version='3.0.3' 1291896328 2010 Dec 9 13:05:28 node2 kern info kernel Initializing cgroup subsys cpuset 1291896328 2010 Dec 9 13:05:28 node2 kern info kernel Initializing cgroup subsys cpu 1291896328 2010 Dec 9 13:05:28 node2 kern notice kernel Linux version 2.6.32-30.el6. .... etc. Releases : corosync-1.2.3-21.el6.x86_64 pacemaker-1.1.2-2.el6.x86_64 Alain _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
