Hi, I have what I believe is a pretty standard pacemaker/drbd setup (crm config pasted at the end) with a single resource group for rabbitmq service,ip and filesystem and a master-slave drbd resource. General failover tests seemed to work, one exception being when I killed the rabbitmq service on an active node and prevented it from restarting. This works (i.e. successfully migrates service group over) on one node, but not in the other. The problem appears to be caused by the following pair of location constraints on the group:
location tomtest_LOC_RULE_0 tomtest 10000: stratus17 location tomtest_LOC_RULE_1 tomtest 9999: stratus18 When the service is active on stratus18, the test passes and successfully migrates everything onto stratus17. When the service is active on stratus17, the service group is stopped and drbd is demoted on stratus17, but it fails to start anything up on stratus18 and syslog shows that it is in a loop of tearing down bringing up the drbd device connections. I am basically left with a secondary/secondary drbd setup and no service. I can also see in syslog that no attempt to promote drbd on stratus18 was made and ptest shows the following score allocations when it is in this state (full score allocation pasted below): drbd_tomtest:1 promotion score on stratus18: -1 drbd_tomtest:0 promotion score on stratus17: -1 So I think this explains why a promotion never occurs at this point, until some manual intervention happens (a resource cleanup operation on the rabbitmq service resource, for example, clears the issue and brings everything back up). I can fix this by setting the same score on the above location constraints, e.g. setting both to 10000. Once this is changed, migration occurs happily between both nodes for the restart failure scenario. Is there documentation on the pacemaker score allocation algorithm or can someone explain why the above occurs? I think understand some of the basics, e.g. the STONITH resource scores below are pretty straight-forward and I understand that collocation and presumably group colour contribute to the final score. However, I don't know how the final set of native_color scores are arrived at. Is there something fundamentally wrong with the way I have configured the constraints? You might wonder why there is location constraint for both resource group and its members, but should that be causing issues? Any insight appreciated. Thanks, Tom crm configure show: node stratus17 node stratus18 node stratus20 primitive STONITH0-stratus17 stonith:external/ipmi \ params interface="lanplus" ipaddr="192.168.11.176" hostname="stratus17" userid="exds" passwd="[REDACTED] " \ op start interval="0" timeout="300" \ op stop interval="0" timeout="120" \ op monitor interval="10" timeout="20" primitive STONITH0-stratus18 stonith:external/ipmi \ params interface="lanplus" ipaddr="192.168.11.177" hostname="stratus18" userid="exds" passwd="[REDACTED]" \ op start interval="0" timeout="300" \ op stop interval="0" timeout="120" \ op monitor interval="10" timeout="20" primitive STONITH0-stratus20 stonith:external/ipmi \ params interface="lanplus" ipaddr="192.168.11.179" hostname="stratus20" userid="exds" passwd="[REDACTED]" \ op start interval="0" timeout="300" \ op stop interval="0" timeout="120" \ op monitor interval="10" timeout="20" primitive drbd_tomtest ocf:linbit:drbd \ params drbd_resource="tomtest" \ op start interval="0" timeout="240" \ op stop interval="0" timeout="100" \ op monitor interval="10" timeout="20" \ op monitor interval="20" timeout="20" primitive tomtest_FS ocf:heartbeat:Filesystem \ params device="/dev/drbd/by-res/tomtest" directory="/mnt/cinder-rmq-mnt" fstype="ext4" \ op start interval="0" timeout="60" \ op stop interval="0" timeout="60" \ op monitor interval="10" timeout="20" primitive tomtest_VIP ocf:heartbeat:IPaddr2 \ params ip="192.168.185.150" \ op start interval="0" timeout="20" \ op stop interval="0" timeout="20" \ op monitor interval="10" timeout="20" primitive tomtest_rabbitmq_SERVICE ocf:rabbitmq:rabbitmq-server \ params ip="192.168.185.150" mnesia_base="/mnt/cinder-rmq-mnt" \ op start interval="0" timeout="600" \ op stop interval="0" timeout="200" \ op monitor interval="20" timeout="30" \ meta target-role="Started" group tomtest tomtest_VIP tomtest_FS tomtest_rabbitmq_SERVICE ms ms_drbd_tomtest drbd_tomtest \ meta master-node-max="1" clone-max="2" clone-node-max="1" master-max="1" notify="true" target-role="Master" location STONITH0-stratus17_LOC_RULE_0 STONITH0-stratus17 10000: stratus18 location STONITH0-stratus17_LOC_RULE_1 STONITH0-stratus17 9999: stratus20 location STONITH0-stratus18_LOC_RULE_0 STONITH0-stratus18 10000: stratus17 location STONITH0-stratus18_LOC_RULE_1 STONITH0-stratus18 9999: stratus20 location STONITH0-stratus20_LOC_RULE_0 STONITH0-stratus20 10000: stratus17 location STONITH0-stratus20_LOC_RULE_1 STONITH0-stratus20 9999: stratus18 location ms_drbd_tomtest_LOC_RULE_0 ms_drbd_tomtest 10000: stratus17 location ms_drbd_tomtest_LOC_RULE_1 ms_drbd_tomtest 9999: stratus18 location tomtest_FS_LOC_RULE_0 tomtest_FS 10000: stratus17 location tomtest_FS_LOC_RULE_1 tomtest_FS 9999: stratus18 location tomtest_LOC_RULE_0 tomtest 10000: stratus17 location tomtest_LOC_RULE_1 tomtest 9999: stratus18 location tomtest_VIP_LOC_RULE_0 tomtest_VIP 10000: stratus17 location tomtest_VIP_LOC_RULE_1 tomtest_VIP 9999: stratus18 location tomtest_rabbitmq_SERVICE_LOC_RULE_0 tomtest_rabbitmq_SERVICE 10000: stratus17 location tomtest_rabbitmq_SERVICE_LOC_RULE_1 tomtest_rabbitmq_SERVICE 9999: stratus18 colocation tomtest_COLOCATION inf: tomtest_VIP tomtest_FS tomtest_rabbitmq_SERVICE colocation tomtest_on_ms_drbd_tomtest_COLOCATION inf: tomtest ms_drbd_tomtest:Master order tomtest_ORDER inf: tomtest_VIP tomtest_FS tomtest_rabbitmq_SERVICE order tomtest_after_ms_drbd_tomtest_ORDER inf: ms_drbd_tomtest:promote tomtest:start property $id="cib-bootstrap-options" \ dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ cluster-infrastructure="openais" \ expected-quorum-votes="3" \ stonith-enabled="true" \ symmetric-cluster="false" \ last-lrm-refresh="1376434483" rsc_defaults $id="rsc-options" \ [Allocation scores when cluster is in bad state, failing to promote drbd anywhere]: Allocation scores: native_color: STONITH0-stratus17 allocation score on stratus18: 10100 native_color: STONITH0-stratus17 allocation score on stratus20: 9999 native_color: STONITH0-stratus18 allocation score on stratus17: 10100 native_color: STONITH0-stratus18 allocation score on stratus20: 9999 native_color: STONITH0-stratus20 allocation score on stratus17: 10100 native_color: STONITH0-stratus20 allocation score on stratus18: 9999 group_color: tomtest allocation score on stratus17: 10000 group_color: tomtest allocation score on stratus18: 9999 group_color: tomtest_VIP allocation score on stratus17: 20000 group_color: tomtest_VIP allocation score on stratus18: 19998 group_color: tomtest_FS allocation score on stratus17: 10000 group_color: tomtest_FS allocation score on stratus18: 9999 group_color: tomtest_rabbitmq_SERVICE allocation score on stratus17: -INFINITY group_color: tomtest_rabbitmq_SERVICE allocation score on stratus18: 9999 group_color: tomtest_rabbitmq_SERVICE allocation score on stratus20: -INFINITY clone_color: ms_drbd_tomtest allocation score on stratus17: 1 clone_color: ms_drbd_tomtest allocation score on stratus18: 89991 clone_color: ms_drbd_tomtest allocation score on stratus20: -INFINITY clone_color: drbd_tomtest:0 allocation score on stratus17: 10000 clone_color: drbd_tomtest:0 allocation score on stratus18: 11099 clone_color: drbd_tomtest:0 allocation score on stratus20: -INFINITY clone_color: drbd_tomtest:1 allocation score on stratus17: 11100 clone_color: drbd_tomtest:1 allocation score on stratus18: 9999 clone_color: drbd_tomtest:1 allocation score on stratus20: -INFINITY native_color: drbd_tomtest:1 allocation score on stratus17: -INFINITY native_color: drbd_tomtest:1 allocation score on stratus18: 89991 native_color: drbd_tomtest:1 allocation score on stratus20: -INFINITY native_color: drbd_tomtest:0 allocation score on stratus17: 50000 native_color: drbd_tomtest:0 allocation score on stratus18: -INFINITY native_color: drbd_tomtest:0 allocation score on stratus20: -INFINITY drbd_tomtest:2 promotion score on none: 0 drbd_tomtest:1 promotion score on stratus18: -1 drbd_tomtest:0 promotion score on stratus17: -1 native_color: tomtest_VIP allocation score on stratus17: -INFINITY native_color: tomtest_VIP allocation score on stratus18: -INFINITY native_color: tomtest_FS allocation score on stratus17: -INFINITY native_color: tomtest_FS allocation score on stratus18: -INFINITY native_color: tomtest_rabbitmq_SERVICE allocation score on stratus17: -INFINITY native_color: tomtest_rabbitmq_SERVICE allocation score on stratus18: -INFINITY native_color: tomtest_rabbitmq_SERVICE allocation score on stratus20: -INFINITY [Allocation scores when cluster is in good state, stratus18 is currently active]: Allocation scores: native_color: STONITH0-stratus17 allocation score on stratus18: 10100 native_color: STONITH0-stratus17 allocation score on stratus20: 9999 native_color: STONITH0-stratus18 allocation score on stratus17: 10100 native_color: STONITH0-stratus18 allocation score on stratus20: 9999 native_color: STONITH0-stratus20 allocation score on stratus17: 10100 native_color: STONITH0-stratus20 allocation score on stratus18: 9999 group_color: tomtest allocation score on stratus17: 10000 group_color: tomtest allocation score on stratus18: 10000 group_color: tomtest_VIP allocation score on stratus17: 20000 group_color: tomtest_VIP allocation score on stratus18: 20099 group_color: tomtest_FS allocation score on stratus17: 10000 group_color: tomtest_FS allocation score on stratus18: 10099 group_color: tomtest_rabbitmq_SERVICE allocation score on stratus17: -INFINITY group_color: tomtest_rabbitmq_SERVICE allocation score on stratus18: 10099 group_color: tomtest_rabbitmq_SERVICE allocation score on stratus20: -INFINITY clone_color: ms_drbd_tomtest allocation score on stratus17: 1 clone_color: ms_drbd_tomtest allocation score on stratus18: 90692 clone_color: ms_drbd_tomtest allocation score on stratus20: -INFINITY clone_color: drbd_tomtest:0 allocation score on stratus17: 10000 clone_color: drbd_tomtest:0 allocation score on stratus18: 20099 clone_color: drbd_tomtest:0 allocation score on stratus20: -INFINITY clone_color: drbd_tomtest:1 allocation score on stratus17: 20100 clone_color: drbd_tomtest:1 allocation score on stratus18: 9999 clone_color: drbd_tomtest:1 allocation score on stratus20: -INFINITY native_color: drbd_tomtest:0 allocation score on stratus17: -INFINITY native_color: drbd_tomtest:0 allocation score on stratus18: 100792 native_color: drbd_tomtest:0 allocation score on stratus20: -INFINITY native_color: drbd_tomtest:1 allocation score on stratus17: 60100 native_color: drbd_tomtest:1 allocation score on stratus18: -INFINITY native_color: drbd_tomtest:1 allocation score on stratus20: -INFINITY drbd_tomtest:0 promotion score on stratus18: 181385 drbd_tomtest:1 promotion score on stratus17: 1 drbd_tomtest:2 promotion score on none: 0 native_color: tomtest_VIP allocation score on stratus17: -INFINITY native_color: tomtest_VIP allocation score on stratus18: 181485 native_color: tomtest_FS allocation score on stratus17: -INFINITY native_color: tomtest_FS allocation score on stratus18: 30297 native_color: tomtest_rabbitmq_SERVICE allocation score on stratus17: -INFINITY native_color: tomtest_rabbitmq_SERVICE allocation score on stratus18: 10099 native_color: tomtest_rabbitmq_SERVICE allocation score on stratus20: -INFINITY
_______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org