Hi,
I'm once again experiencing (imho) strange behaviour respectively
decision-making by Pacemaker, and I hope that someone can either
enlighten me a little about this, its intention and/or a possible
misconfiguration or something, or confirm it a possible bug.
Basically I have a cluster of 2 nodes with cloned DLM-, O2CB-, DRBD-,
mount-resources, and a MySQL-resource (grouped with an IPaddr-resource)
running on top of the other ones.
The MySQL(-group)-resource depends on the mount-resource, which depends
on both, the DRBD- and the O2CB-resources equally, and the O2CB-resource
depends on the DLM-resource.
cloneDlm -> cloneO2cb -\
}-> cloneMountMysql -> mysql / grpMysql( mysql
-> ipMysql )
msDrbdMysql -----------/
Furthermore for the MySQL(-group)-resource I set meta-attributes
"migration-threshold=1" and "failure-timeout=90" (later also tried
settings "3" and "130" for these).
Now I picked a little on mysql using "crm_resource -F -r mysql -H
<node>", expecting that only mysql respectively its group (tested both
configurations; same result) would be stopped (and moved over to the
other node).
But actually not only mysql/grpMysql was stopped, but also the mount-
and even the DRBD-resources were stopped, and upon restarting them the
DRBD-resource was left as slave (thus the mount of course wasn't allowed
to restart either) and - back then before I set
cluster-recheck-interval=2m - didn't seem to even try to promote back to
master (didn't wait cluster-recheck-interval's default 15m).
Now through a lot of testing I found out that:
a) the stops/restarts of the underlying resources happen only when
failcounter hits the limit set by migration-threshold; i.e. when set to
3, on first 2 failures only mysql/grpMysql is restarted on the same node
and only on 3rd one underlying resources are left in a mess (while
mysql/grpMysql migrates) (for DRBD reproducable; unsure about
DLM/O2CB-side, but there's sometimes hard trouble too after having
picked on mysql; just couldn't definitively link it yet)
b) upon causing mysql/grpMysql's migration, score for
msDrbdMysql:promote changes from 10020 to -inf and stays there for the
time of mysql/grpMysql's failure-timeout (proved with also setting to
130), before it rises back up to 10000
c) msDrbdMysql remains slave until the next cluster-recheck after its
promote-score went back up to 10000
d) I also have the impression that fail-counters don't get reset after
their failure-timeout, because when migration-threshold=3 is set, upon
every(!) following picking-on those issues occure, even when I've waited
for nearly 5 minutes (with failure-timeout=90) without any touching the
cluster
I experienced this on both test-clusters, a SLES 11 HAE SP1 with
Pacemaker 1.1.2, and a Debian Squeeze with Pacemaker 1.0.9. When
migration-threshold for mysql/grpMysql is removed, everything is fine
(except no migration of course). I can't remember such happening with
SLES 11 HAE SP0's Pacemaker 1.0.6.
I'd really appreciate any comment and/or enlightment about what's the
deal with this. (-;
p.s.: Just for fun / testing / proving I just also contrainted
grpLdirector to cloneMountShared... and could perfectly reproduce that
problem with its then underlying resources too.
================================================================================
2) mysql: meta migration-threshold=1 failure-timeout=130 ->
drbd:promote erst nach 130sek score-technisch wieder möglich
nde34:~ # nd=nde35;cl=1;failcmd="crm_resource -F -r mysql -H $nd" ; date
; ptest -sL | grep "drbdMysql:$cl promotion score on $nd" ; date ; echo
$failcmd; $failcmd ; date ; ptest -sL | grep "drbdMysql:$cl promotion
score on $nd" ; sleep 85 ; while [ true ]; do date ; ptest -sL | grep
"drbdMysql:$cl promotion score on $nd" ; sleep 5; done
Wed Aug 11 15:33:04 CEST 2010
drbdMysql:1 promotion score on nde35: 10020
drbdMysql:1 promotion score on nde35: INFINITY
drbdMysql:1 promotion score on nde35: INFINITY
Wed Aug 11 15:33:04 CEST 2010
crm_resource -F -r mysql -H nde35
Wed Aug 11 15:33:05 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
Wed Aug 11 15:34:31 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
[...]
Wed Aug 11 15:35:11 CEST 2010
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
drbdMysql:1 promotion score on nde35: -INFINITY
Wed Aug 11 15:35:16 CEST 2010
drbdMysql:1 promotion score on nde35: 10000
drbdMysql:1 promotion score on nde35: INFINITY
drbdMysql:1 promotion score on nde35: INFINITY
^C
node nde34 \
attributes standby="off"
node nde35 \
attributes standby="off"
primitive apache ocf:cj:apache \
params
monitor_url="http://localhost:8080/opencms/opencms/test/cluster.html"
log_level="warn" agent_timebuffer="1000" stopith_killall_enabled="1" \
op monitor interval="10" timeout="15" start-delay="15" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="120"
primitive dlm ocf:pacemaker:controld \
op monitor interval="10" timeout="20" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="100"
primitive drbdMysql ocf:linbit:drbd \
params drbd_resource="mysql" \
op monitor interval="10" role="Master" timeout="20" \
op monitor interval="20" role="Slave" timeout="20" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="100" \
op promote interval="0" timeout="90" \
op demote interval="0" timeout="90" \
op notify interval="0" timeout="90"
primitive drbdOpencms ocf:linbit:drbd \
params drbd_resource="opencms" \
op monitor interval="10" role="Master" timeout="20" \
op monitor interval="20" role="Slave" timeout="20" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="100" \
op promote interval="0" timeout="90" \
op demote interval="0" timeout="90" \
op notify interval="0" timeout="90"
primitive drbdShared ocf:linbit:drbd \
params drbd_resource="wt-cluster" \
op monitor interval="10" role="Master" timeout="20" \
op monitor interval="20" role="Slave" timeout="20" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="100" \
op promote interval="0" timeout="90" \
op demote interval="0" timeout="90" \
op notify interval="0" timeout="90"
primitive ipLdirector ocf:heartbeat:IPaddr2 \
params lvs_support="true" ip="192.168.103.73" cidr_netmask="24"
broadcast="2.255.255.255" \
op monitor interval="5"
primitive ipMysql ocf:heartbeat:IPaddr \
params ip="192.168.103.74" cidr_netmask="255.255.255.0" \
op monitor interval="2" timeout="20" \
op start interval="0" timeout="90"
primitive ldirector ocf:heartbeat:ldirectord \
params configfile="/etc/ha.d/ldirectord.cf"
ldirectord="/usr/sbin/ldirectord" \
op monitor interval="20" timeout="10" \
op start interval="0" timeout="15" \
op stop interval="0" timeout="15"
primitive mountMysql ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/var/lib/mysql" fstype="ocfs2" \
op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive mountOpencms ocf:heartbeat:Filesystem \
params device="/dev/drbd1" directory="/srv/tomcat6/webapps/opencms"
fstype="ocfs2" \
op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive mountShared ocf:heartbeat:Filesystem \
params device="/dev/drbd2" directory="/opt/wt-cluster" fstype="ocfs2" \
op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive mysql ocf:heartbeat:mysql \
params binary="/usr/bin/mysqld_safe" config="/var/lib/mysql/my.cnf"
pid="/var/run/mysql/mysqld.pid" socket="/var/lib/mysql/mysql.sock"
test_table="test.HA_checkAvailability" test_user="HAmonUser"
test_passwd="HAmonPW" \
op monitor interval="10" timeout="30" OCF_CHECK_LEVEL="1" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120"
primitive o2cb ocf:ocfs2:o2cb \
op monitor interval="10" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="100"
primitive tomcat ocf:cj:tomcat \
params
monitor_url="http://localhost:8080/opencms/opencms/test/cluster.html"
log_level="warn" agent_timebuffer="1000" stopith_killall_enabled="1" \
op monitor interval="10" timeout="15" start-delay="15" \
op start interval="0" timeout="30" \
op stop interval="0" timeout="120"
group grpLdirector ldirector ipLdirector \
meta migration-threshold="1" failure-timeout="60"
group grpMysql mysql ipMysql \
meta migration-threshold="2" failure-timeout="90"
ms msDrbdMysql drbdMysql \
meta resource-stickiness="100" notify="true" master-max="2"
ms msDrbdOpencms drbdOpencms \
meta resource-stickiness="100" notify="true" master-max="2"
ms msDrbdShared drbdShared \
meta resource-stickiness="100" notify="true" master-max="2"
clone cloneApache apache
clone cloneDlm dlm \
meta globally-unique="false" interleave="true"
clone cloneMountMysql mountMysql \
meta interleave="true" globally-unique="false" target-role="Started"
clone cloneMountOpencms mountOpencms \
meta interleave="true" globally-unique="false" target-role="Started"
clone cloneMountShared mountShared \
meta interleave="true" globally-unique="false" target-role="Started"
clone cloneO2cb o2cb \
meta globally-unique="false" interleave="true"
clone cloneTomcat tomcat \
meta target-role="Stopped"
colocation colocApache inf: cloneApache cloneTomcat
colocation colocGrpLdirector inf: grpLdirector cloneMountShared
colocation colocGrpMysql inf: grpMysql cloneMountMysql
colocation colocMountMysql_drbd inf: cloneMountMysql msDrbdMysql:Master
colocation colocMountMysql_o2cb inf: cloneMountMysql cloneO2cb
colocation colocMountOpencms_drbd inf: cloneMountOpencms msDrbdOpencms:Master
colocation colocMountOpencms_o2cb inf: cloneMountOpencms cloneO2cb
colocation colocMountShared_drbd inf: cloneMountShared msDrbdShared:Master
colocation colocMountShared_o2cb inf: cloneMountShared cloneO2cb
colocation colocO2cb inf: cloneO2cb cloneDlm
colocation colocTomcat inf: cloneTomcat cloneMountOpencms
order orderApache inf: cloneTomcat cloneApache
order orderGrpLdirector inf: cloneMountShared grpLdirector
order orderGrpMysql inf: cloneMountMysql grpMysql
order orderMountMysql_drbd inf: msDrbdMysql:promote cloneMountMysql:start
order orderMountMysql_o2cb inf: cloneO2cb cloneMountMysql
order orderMountOpencms_drbd inf: msDrbdOpencms:promote cloneMountOpencms:start
order orderMountOpencms_o2cb inf: cloneO2cb cloneMountOpencms
order orderMountShared_drbd inf: msDrbdShared:promote cloneMountShared:start
order orderMountShared_o2cb inf: cloneO2cb cloneMountShared
order orderO2cb inf: cloneDlm cloneO2cb
order orderTomcat inf: cloneMountOpencms cloneTomcat
property $id="cib-bootstrap-options" \
dc-version="1.1.2-2e096a41a5f9e184a1c1537c82c6da1093698eb5" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
start-failure-is-fatal="false" \
cluster-recheck-interval="5m" \
shutdown-escalation="5m" \
last-lrm-refresh="1281543643"
rsc_defaults $id="rsc-options" \
resource-stickiness="5"
node alpha \
attributes standby="off"
node beta \
attributes standby="off"
primitive dlm ocf:pacemaker:controld \
op monitor interval="10" timeout="20" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="100"
primitive drbdShared ocf:linbit:drbd \
params drbd_resource="shared" \
op monitor interval="10" role="Master" timeout="20" \
op monitor interval="20" role="Slave" timeout="20" \
op start interval="0" timeout="240" \
op stop interval="0" timeout="100" \
op promote interval="0" timeout="90" \
op demote interval="0" timeout="90" \
op notify interval="0" timeout="90"
primitive ipMysql ocf:heartbeat:IPaddr \
params ip="192.168.135.67" cidr_netmask="255.255.0.0" \
op monitor interval="2" timeout="20" \
op start interval="0" timeout="90"
primitive mountShared ocf:heartbeat:Filesystem \
params device="/dev/drbd0" directory="/shared" fstype="ocfs2" \
op monitor interval="10" timeout="40" OCF_CHECK_LEVEL="10" \
op start interval="0" timeout="60" \
op stop interval="0" timeout="60"
primitive mysql ocf:heartbeat:mysql \
params binary="/usr/bin/mysqld_safe" config="/var/lib/mysql/my.cnf"
pid="/var/run/mysqld/mysqld.pid" socket="/var/lib/mysql/mysqld.sock"
test_table="ha.check" test_user="HAuser" test_passwd="HApass" \
op monitor interval="10" timeout="30" OCF_CHECK_LEVEL="0" \
op start interval="0" timeout="120" \
op stop interval="0" timeout="120"
primitive o2cb ocf:pacemaker:o2cb \
op monitor interval="10" \
op start interval="0" timeout="90" \
op stop interval="0" timeout="100"
group grpMysql mysql ipMysql \
meta migration-threshold="3" failure-timeout="30"
ms msDrbdShared drbdShared \
meta resource-stickiness="100" notify="true" master-max="2"
clone cloneDlm dlm \
meta globally-unique="false" interleave="true"
clone cloneMountShared mountShared \
meta interleave="true" globally-unique="false" target-role="Started"
clone cloneO2cb o2cb \
meta globally-unique="false" interleave="true" target-role="Started"
colocation colocMountShared_drbd inf: cloneMountShared msDrbdShared:Master
colocation colocMountShared_o2cb inf: cloneMountShared cloneO2cb
colocation colocMysql inf: grpMysql cloneMountShared
colocation colocO2cb inf: cloneO2cb cloneDlm
order orderMountShared_drbd inf: msDrbdShared:promote cloneMountShared:start
order orderMountShared_o2cb inf: cloneO2cb cloneMountShared
order orderMysql inf: cloneMountShared grpMysql
order orderO2cb inf: cloneDlm cloneO2cb
property $id="cib-bootstrap-options" \
dc-version="1.0.9-unknown" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
stonith-enabled="false" \
start-failure-is-fatal="false" \
last-lrm-refresh="1281577809" \
cluster-recheck-interval="4m" \
shutdown-escalation="5m"
_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker
Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker