Hi,

I'm once again experiencing (imho) strange behaviour respectively decision-making by Pacemaker, and I hope that someone can either enlighten me a little about this, its intention and/or a possible misconfiguration or something, or confirm it a possible bug.

Basically I have a cluster of 2 nodes with cloned DLM-, O2CB-, DRBD-, mount-resources, and a MySQL-resource (grouped with an IPaddr-resource) running on top of the other ones. The MySQL(-group)-resource depends on the mount-resource, which depends on both, the DRBD- and the O2CB-resources equally, and the O2CB-resource depends on the DLM-resource.
cloneDlm -> cloneO2cb -\
}-> cloneMountMysql -> mysql / grpMysql( mysql -> ipMysql )
msDrbdMysql -----------/
Furthermore for the MySQL(-group)-resource I set meta-attributes "migration-threshold=1" and "failure-timeout=90" (later also tried settings "3" and "130" for these).

Now I picked a little on mysql using "crm_resource -F -r mysql -H <node>", expecting that only mysql respectively its group (tested both configurations; same result) would be stopped (and moved over to the other node). But actually not only mysql/grpMysql was stopped, but also the mount- and even the DRBD-resources were stopped, and upon restarting them the DRBD-resource was left as slave (thus the mount of course wasn't allowed to restart either) and - back then before I set cluster-recheck-interval=2m - didn't seem to even try to promote back to master (didn't wait cluster-recheck-interval's default 15m).

Now through a lot of testing I found out that:
a) the stops/restarts of the underlying resources happen only when failcounter hits the limit set by migration-threshold; i.e. when set to 3, on first 2 failures only mysql/grpMysql is restarted on the same node and only on 3rd one underlying resources are left in a mess (while mysql/grpMysql migrates) (for DRBD reproducable; unsure about DLM/O2CB-side, but there's sometimes hard trouble too after having picked on mysql; just couldn't definitively link it yet) b) upon causing mysql/grpMysql's migration, score for msDrbdMysql:promote changes from 10020 to -inf and stays there for the time of mysql/grpMysql's failure-timeout (proved with also setting to 130), before it rises back up to 10000 c) msDrbdMysql remains slave until the next cluster-recheck after its promote-score went back up to 10000 d) I also have the impression that fail-counters don't get reset after their failure-timeout, because when migration-threshold=3 is set, upon every(!) following picking-on those issues occure, even when I've waited for nearly 5 minutes (with failure-timeout=90) without any touching the cluster

I experienced this on both test-clusters, a SLES 11 HAE SP1 with Pacemaker 1.1.2, and a Debian Squeeze with Pacemaker 1.0.9. When migration-threshold for mysql/grpMysql is removed, everything is fine (except no migration of course). I can't remember such happening with SLES 11 HAE SP0's Pacemaker 1.0.6.

I'd really appreciate any comment and/or enlightment about what's the deal with this. (-;


p.s.: Just for fun / testing / proving I just also contrainted grpLdirector to cloneMountShared... and could perfectly reproduce that problem with its then underlying resources too.

================================================================================

2) mysql: meta migration-threshold=1 failure-timeout=130 -> drbd:promote erst nach 130sek score-technisch wieder möglich nde34:~ # nd=nde35;cl=1;failcmd="crm_resource -F -r mysql -H $nd" ; date ; ptest -sL | grep "drbdMysql:$cl promotion score on $nd" ; date ; echo $failcmd; $failcmd ; date ; ptest -sL | grep "drbdMysql:$cl promotion score on $nd" ; sleep 85 ; while [ true ]; do date ; ptest -sL | grep "drbdMysql:$cl promotion score on $nd" ; sleep 5; done
Wed Aug 11 15:33:04 CEST 2010
drbdMysql:1 promotion score on nde35: 10020
drbdMysql:1 promotion score on nde35: INFINITY
drbdMysql:1 promotion score on nde35: INFINITY
Wed Aug 11 15:33:04 CEST 2010
crm_resource -F -r mysql -H nde35
Wed Aug 11 15:33:05 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
Wed Aug 11 15:34:31 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
[...]
Wed Aug 11 15:35:11 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
Wed Aug 11 15:35:16 CEST 2010
drbdMysql:1 promotion score on nde35: 10000
drbdMysql:1 promotion score on nde35: INFINITY
drbdMysql:1 promotion score on nde35: INFINITY
^C


node nde34 \
        attributes standby="off"
node nde35 \
        attributes standby="off"
primitive apache ocf:cj:apache \
        params 
monitor_url="http://localhost:8080/opencms/opencms/test/cluster.html"; 
log_level="warn" agent_timebuffer="1000" stopith_killall_enabled="1" \
        op monitor interval="10" timeout="15" start-delay="15" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="120"
primitive dlm ocf:pacemaker:controld \
        op monitor interval="10" timeout="20" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
primitive drbdMysql ocf:linbit:drbd \
        params drbd_resource="mysql" \
        op monitor interval="10" role="Master" timeout="20" \
        op monitor interval="20" role="Slave" timeout="20" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op notify interval="0" timeout="90"
primitive drbdOpencms ocf:linbit:drbd \
        params drbd_resource="opencms" \
        op monitor interval="10" role="Master" timeout="20" \
        op monitor interval="20" role="Slave" timeout="20" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op notify interval="0" timeout="90"
primitive drbdShared ocf:linbit:drbd \
        params drbd_resource="wt-cluster" \
        op monitor interval="10" role="Master" timeout="20" \
        op monitor interval="20" role="Slave" timeout="20" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op notify interval="0" timeout="90"
primitive ipLdirector ocf:heartbeat:IPaddr2 \
        params lvs_support="true" ip="192.168.103.73" cidr_netmask="24" 
broadcast="2.255.255.255" \
        op monitor interval="5"
primitive ipMysql ocf:heartbeat:IPaddr \
        params ip="192.168.103.74" cidr_netmask="255.255.255.0" \
        op monitor interval="2" timeout="20" \
        op start interval="0" timeout="90"
primitive ldirector ocf:heartbeat:ldirectord \
        params configfile="/etc/ha.d/ldirectord.cf" 
ldirectord="/usr/sbin/ldirectord" \
        op monitor interval="20" timeout="10" \
        op start interval="0" timeout="15" \
        op stop interval="0" timeout="15"
primitive mountMysql ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/var/lib/mysql" fstype="ocfs2" \
        op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60"
primitive mountOpencms ocf:heartbeat:Filesystem \
        params device="/dev/drbd1" directory="/srv/tomcat6/webapps/opencms" 
fstype="ocfs2" \
        op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60"
primitive mountShared ocf:heartbeat:Filesystem \
        params device="/dev/drbd2" directory="/opt/wt-cluster" fstype="ocfs2" \
        op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60"
primitive mysql ocf:heartbeat:mysql \
        params binary="/usr/bin/mysqld_safe" config="/var/lib/mysql/my.cnf" 
pid="/var/run/mysql/mysqld.pid" socket="/var/lib/mysql/mysql.sock" 
test_table="test.HA_checkAvailability" test_user="HAmonUser" 
test_passwd="HAmonPW" \
        op monitor interval="10" timeout="30" OCF_CHECK_LEVEL="1" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120"
primitive o2cb ocf:ocfs2:o2cb \
        op monitor interval="10" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
primitive tomcat ocf:cj:tomcat \
        params 
monitor_url="http://localhost:8080/opencms/opencms/test/cluster.html"; 
log_level="warn" agent_timebuffer="1000" stopith_killall_enabled="1" \
        op monitor interval="10" timeout="15" start-delay="15" \
        op start interval="0" timeout="30" \
        op stop interval="0" timeout="120"
group grpLdirector ldirector ipLdirector \
        meta migration-threshold="1" failure-timeout="60"
group grpMysql mysql ipMysql \
        meta migration-threshold="2" failure-timeout="90"
ms msDrbdMysql drbdMysql \
        meta resource-stickiness="100" notify="true" master-max="2"
ms msDrbdOpencms drbdOpencms \
        meta resource-stickiness="100" notify="true" master-max="2"
ms msDrbdShared drbdShared \
        meta resource-stickiness="100" notify="true" master-max="2"
clone cloneApache apache
clone cloneDlm dlm \
        meta globally-unique="false" interleave="true"
clone cloneMountMysql mountMysql \
        meta interleave="true" globally-unique="false" target-role="Started"
clone cloneMountOpencms mountOpencms \
        meta interleave="true" globally-unique="false" target-role="Started"
clone cloneMountShared mountShared \
        meta interleave="true" globally-unique="false" target-role="Started"
clone cloneO2cb o2cb \
        meta globally-unique="false" interleave="true"
clone cloneTomcat tomcat \
        meta target-role="Stopped"
colocation colocApache inf: cloneApache cloneTomcat
colocation colocGrpLdirector inf: grpLdirector cloneMountShared
colocation colocGrpMysql inf: grpMysql cloneMountMysql
colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master
colocation colocMountMysql_o2cb inf: cloneMountMysql cloneO2cb
colocation colocMountOpencms_drbd inf: cloneMountOpencms msDrbdOpencms:Master
colocation colocMountOpencms_o2cb inf: cloneMountOpencms cloneO2cb
colocation colocMountShared_drbd inf: cloneMountShared msDrbdShared:Master
colocation colocMountShared_o2cb inf: cloneMountShared cloneO2cb
colocation colocO2cb inf: cloneO2cb cloneDlm
colocation colocTomcat inf: cloneTomcat cloneMountOpencms
order orderApache inf: cloneTomcat cloneApache
order orderGrpLdirector inf: cloneMountShared grpLdirector
order orderGrpMysql inf: cloneMountMysql grpMysql
order orderMountMysql_drbd inf: msDrbdMysql:promote cloneMountMysql:start
order orderMountMysql_o2cb inf: cloneO2cb cloneMountMysql
order orderMountOpencms_drbd inf: msDrbdOpencms:promote cloneMountOpencms:start
order orderMountOpencms_o2cb inf: cloneO2cb cloneMountOpencms
order orderMountShared_drbd inf: msDrbdShared:promote cloneMountShared:start
order orderMountShared_o2cb inf: cloneO2cb cloneMountShared
order orderO2cb inf: cloneDlm cloneO2cb
order orderTomcat inf: cloneMountOpencms cloneTomcat
property $id="cib-bootstrap-options" \
        dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore" \
        start-failure-is-fatal="false" \
        cluster-recheck-interval="5m" \
        shutdown-escalation="5m" \
        last-lrm-refresh="1281543643"
rsc_defaults $id="rsc-options" \
        resource-stickiness="5"
node alpha \
        attributes standby="off"
node beta \
        attributes standby="off"
primitive dlm ocf:pacemaker:controld \
        op monitor interval="10" timeout="20" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
primitive drbdShared ocf:linbit:drbd \
        params drbd_resource="shared" \
        op monitor interval="10" role="Master" timeout="20" \
        op monitor interval="20" role="Slave" timeout="20" \
        op start interval="0" timeout="240" \
        op stop interval="0" timeout="100" \
        op promote interval="0" timeout="90" \
        op demote interval="0" timeout="90" \
        op notify interval="0" timeout="90"
primitive ipMysql ocf:heartbeat:IPaddr \
        params ip="192.168.135.67" cidr_netmask="255.255.0.0" \
        op monitor interval="2" timeout="20" \
        op start interval="0" timeout="90"
primitive mountShared ocf:heartbeat:Filesystem \
        params device="/dev/drbd0" directory="/shared" fstype="ocfs2" \
        op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60"
primitive mysql ocf:heartbeat:mysql \
        params binary="/usr/bin/mysqld_safe" config="/var/lib/mysql/my.cnf" 
pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysqld.sock" 
test_table="ha.check" test_user="HAuser" test_passwd="HApass" \
        op monitor interval="10" timeout="30" OCF_CHECK_LEVEL="0" \
        op start interval="0" timeout="120" \
        op stop interval="0" timeout="120"
primitive o2cb ocf:pacemaker:o2cb \
        op monitor interval="10" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100"
group grpMysql mysql ipMysql \
        meta migration-threshold="3" failure-timeout="30"
ms msDrbdShared drbdShared \
        meta resource-stickiness="100" notify="true" master-max="2"
clone cloneDlm dlm \
        meta globally-unique="false" interleave="true"
clone cloneMountShared mountShared \
        meta interleave="true" globally-unique="false" target-role="Started"
clone cloneO2cb o2cb \
        meta globally-unique="false" interleave="true" target-role="Started"
colocation colocMountShared_drbd inf: cloneMountShared msDrbdShared:Master
colocation colocMountShared_o2cb inf: cloneMountShared cloneO2cb
colocation colocMysql inf: grpMysql cloneMountShared
colocation colocO2cb inf: cloneO2cb cloneDlm
order orderMountShared_drbd inf: msDrbdShared:promote cloneMountShared:start
order orderMountShared_o2cb inf: cloneO2cb cloneMountShared
order orderMysql inf: cloneMountShared grpMysql
order orderO2cb inf: cloneDlm cloneO2cb
property $id="cib-bootstrap-options" \
        dc-version="1.0.9-unknown" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        start-failure-is-fatal="false" \
        last-lrm-refresh="1281577809" \
        cluster-recheck-interval="4m" \
        shutdown-escalation="5m"
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Reply via email to